Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers.
GPRs may contain either integer or
floating-point values.
Going with a bit result vector in any GPR for compares, then a branch on bit-set/clear for conditional branches. Might also include branch true / false.
Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.
Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or floating-point values.
Going with a bit result vector in any GPR for compares, then a branch on bit-set/clear for conditional branches. Might also include branch true / false.
Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.
On 10/28/2025 8:52 PM, Robert Finch wrote:
Started working on yet another CPU – Qupls4. Fixed 40-bit
instructions, 64 GPRs. GPRs may be used in pairs for 128-bit ops.
Registers are named as if there were 32 GPRs, A0 (arg 0 register is
r1) and A0H (arg 0 high is r33). Sameo for other registers.
I assume the "high" registers are for handling 128 bit operations
without the need to specify another register name. Do you have 5 or 6
bit register numbers in the instructions. Five allows you to use the
high registers for 128 bit operations without needing another register specifier, but then the high registers can only be used for 128 bit operations, which seems a waste. If you have six bits, you can use all
64 registers for any operation, but how is the "upper" method that
better than automatically using r(x+1)?
GPRs may contain either integer or floating-point values.
Going with a bit result vector in any GPR for compares, then a branch
on bit-set/clear for conditional branches. Might also include branch
true / false.
Using operand routing for immediate constants and an operation size
for the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be
10,50,90 or 130 bits.
Those seem like a call from the My 66000 playbook, which I like.
On 2025-10-29 3:14 a.m., Stephen Fuld wrote:
On 10/28/2025 8:52 PM, Robert Finch wrote:Yes, but it is just a suggested usage. The registers are GPRs that can
Started working on yet another CPU – Qupls4. Fixed 40-bit
instructions, 64 GPRs. GPRs may be used in pairs for 128-bit ops.
Registers are named as if there were 32 GPRs, A0 (arg 0 register is
r1) and A0H (arg 0 high is r33). Sameo for other registers.
I assume the "high" registers are for handling 128 bit operations
without the need to specify another register name. Do you have 5 or 6
bit register numbers in the instructions. Five allows you to use the
high registers for 128 bit operations without needing another register
specifier, but then the high registers can only be used for 128 bit
operations, which seems a waste. If you have six bits, you can use
all 64 registers for any operation, but how is the "upper" method that
better than automatically using r(x+1)?
be used for anything, specified using a six bit register number. I
suggested it that way because most of the time register values would be passed around as 64-bit quantities and it keeps the same set of
registers for the same register type (argument, temp, saved). But since
it should be using mostly compiled code, it does not make much difference.
Also, the high registers could be used as FP registers. Maybe allowing
for saving only the low order 32 regs during a context switch.>
Yup.>
GPRs may contain either integer or floating-point values.
Going with a bit result vector in any GPR for compares, then a branch
on bit-set/clear for conditional branches. Might also include branch
true / false.
Using operand routing for immediate constants and an operation size
for the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be
10,50,90 or 130 bits.
Those seem like a call from the My 66000 playbook, which I like.
Do you have 5 or 6
bit register numbers in the instructions. Five allows you to use the
high registers for 128 bit operations without needing another register >specifier, but then the high registers can only be used for 128 bit >operations, which seems a waste.
On 2025-10-29 8:41 a.m., Robert Finch wrote:
On 2025-10-29 3:14 a.m., Stephen Fuld wrote:
On 10/28/2025 8:52 PM, Robert Finch wrote:Yes, but it is just a suggested usage. The registers are GPRs that can
Started working on yet another CPU – Qupls4. Fixed 40-bit
instructions, 64 GPRs. GPRs may be used in pairs for 128-bit ops.
Registers are named as if there were 32 GPRs, A0 (arg 0 register is
r1) and A0H (arg 0 high is r33). Sameo for other registers.
I assume the "high" registers are for handling 128 bit operations
without the need to specify another register name. Do you have 5 or
6 bit register numbers in the instructions. Five allows you to use
the high registers for 128 bit operations without needing another
register specifier, but then the high registers can only be used for
128 bit operations, which seems a waste. If you have six bits, you
can use all 64 registers for any operation, but how is the "upper"
method that better than automatically using r(x+1)?
be used for anything, specified using a six bit register number. I
suggested it that way because most of the time register values would
be passed around as 64-bit quantities and it keeps the same set of
registers for the same register type (argument, temp, saved). But
since it should be using mostly compiled code, it does not make much
difference.
Also, the high registers could be used as FP registers. Maybe allowing
for saving only the low order 32 regs during a context switch.>
I should mention that the high registers are available only in user/app mode. For other modes of operation only the low order 32 registers are available. I did this to reduce the number of logical registers in the design. There are about 160 (64+32+32+32) logical registers then. TheyYup.>
GPRs may contain either integer or floating-point values.
Going with a bit result vector in any GPR for compares, then a
branch on bit-set/clear for conditional branches. Might also include
branch true / false.
Using operand routing for immediate constants and an operation size
for the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be
10,50,90 or 130 bits.
Those seem like a call from the My 66000 playbook, which I like.
are supported by 512 physical registers. My previous design had 224
logical registers which eats up more hardware, probably for little benefit.
Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or floating-point values.
Going with a bit result vector in any GPR for compares, then a branch on bit-set/clear for conditional branches. Might also include branch true / false.
Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
Do you have 5 or 6
bit register numbers in the instructions. Five allows you to use the
high registers for 128 bit operations without needing another register
specifier, but then the high registers can only be used for 128 bit
operations, which seems a waste.
Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or floating-point values.
Going with a bit result vector in any GPR for compares, then a branch on bit-set/clear for conditional branches. Might also include branch true / false.
Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.
On 10/28/2025 10:52 PM, Robert Finch wrote:
Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or floating-point values.
OK.
I mostly stuck with 32-bit encodings, but 40 could maybe allow more
encoding space, but the drawback of being non-power-of-2.
But, yeah, occasionally dealing with 128-bit data is a major case for 64 GPRs and paired-registers registers.
My case: 10/33/64.
No direct 128-bit constant, but can use two 64-bit constants whenever
128 bits is needed.
Otherwise, goings on in my land:<snip>
ISA development is slow, and had mostly turned into bug hunting;
The longer term future is uncertain.
My ISA's can beat RISC-V in terms of code-density and performance, but
when when RISC-V is extended with similar features, it is harder to make
a case that it is "enough".
Doesn't seem like (within the ISA) there are many obvious ways left to
grab large general-case performance gains over what I have done already.
Some code benefits from lots of GPRs, but harder to make the case that
it reflects the general case.
Recently got a new very-cheap laptop (a Dell Latitude 7490, for around $240), made some curious observations:
It seems to slightly outperform my main PC in single-threaded performance; Its RAM timings don't seem to match the expected values.
My main PC still wins at multi-threaded performance, and has the
advantage of 7x more RAM.
Robert Finch <robfi680@gmail.com> schrieb:
Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or
floating-point values.
I understand the temptation to go for more bits :-) What is your
instruction alignment? Bytewise so 40 bits fit, or do you have some alignment that the first instruction of a cache line is always aligned?
Having register pairs does not make the compiler writer's life easier, unfortunately.
Going with a bit result vector in any GPR for compares, then a branch on
bit-set/clear for conditional branches. Might also include branch true /
false.
Having 64 registers and 64 bit registers makes life easier for that particular task :-)
If you have that many bits available, do you still go for a load-store architecture, or do you have memory operations? This could offset the
larger size of your instructions.
Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.
Those sizes are not really a good fit for constants from programs,
where quite a few constants tend to be 32 or 64 bits. Would a
64-bit FP constant leave 26 bits empty?
BGB <cr88192@gmail.com> posted:
But, yeah, occasionally dealing with 128-bit data is a major case for 64
GPRs and paired-registers registers.
There is always the DBLE pseudo-instruction.
DBLE Rd,Rs1,Rs2,Rs3
All DBLE does is to provide more registers for the wide computation
in such a way that compiler is not forced to pair or share any reg-
isters. The other thing DBLE does is to tell the decoder that the
next instruction is 2× as wide as its OpCode states. In lower end
machines (and in GPUs) DBLE is sequenced as if it were an instruction.
In higher end machines, DBLE would be CoIssued with its mate.
BGB <cr88192@gmail.com> posted:
On 10/28/2025 10:52 PM, Robert Finch wrote:
Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, >>> 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or
floating-point values.
OK.
I mostly stuck with 32-bit encodings, but 40 could maybe allow more
encoding space, but the drawback of being non-power-of-2.
it is definitely an issue.
But, yeah, occasionally dealing with 128-bit data is a major case for 64
GPRs and paired-registers registers.
There is always the DBLE pseudo-instruction.
DBLE Rd,Rs1,Rs2,Rs3
All DBLE does is to provide more registers for the wide computation
in such a way that compiler is not forced to pair or share any reg-
isters. The other thing DBLE does is to tell the decoder that the
next instruction is 2× as wide as its OpCode states. In lower end
machines (and in GPUs) DBLE is sequenced as if it were an instruction.
In higher end machines, DBLE would be CoIssued with its mate.
----------
My case: 10/33/64.
No direct 128-bit constant, but can use two 64-bit constants whenever
128 bits is needed.
{5, 16, 32, 64}-bit immediates.
<snip>
Otherwise, goings on in my land:
ISA development is slow, and had mostly turned into bug hunting;
The longer term future is uncertain.
My ISA's can beat RISC-V in terms of code-density and performance, but
when when RISC-V is extended with similar features, it is harder to make
a case that it is "enough".
I am still running at 70% RISC-Vs instruction count.
Doesn't seem like (within the ISA) there are many obvious ways left to
grab large general-case performance gains over what I have done already.
Fewer instructions, and or instructions that take fewer cycles to execute.
Example, ENTER and EXIT instructions move 4 registers per cycle to/from
cache in a pipeline that has 1 result per cycle.
Some code benefits from lots of GPRs, but harder to make the case that
it reflects the general case.
There is very little to be gained with that many registers.
Recently got a new very-cheap laptop (a Dell Latitude 7490, for around
$240), made some curious observations:
It seems to slightly outperform my main PC in single-threaded performance; >> Its RAM timings don't seem to match the expected values.
My main PC still wins at multi-threaded performance, and has the
advantage of 7x more RAM.
My new Linux box has 64 cores at 4.5 GHz and 96GB of DRAM.
On 10/29/2025 11:47 AM, MitchAlsup wrote:
BGB <cr88192@gmail.com> posted:
snip
But, yeah, occasionally dealing with 128-bit data is a major case for 64 >> GPRs and paired-registers registers.
There is always the DBLE pseudo-instruction.
DBLE Rd,Rs1,Rs2,Rs3
All DBLE does is to provide more registers for the wide computation
in such a way that compiler is not forced to pair or share any reg-
isters. The other thing DBLE does is to tell the decoder that the
next instruction is 2× as wide as its OpCode states. In lower end
machines (and in GPUs) DBLE is sequenced as if it were an instruction.
In higher end machines, DBLE would be CoIssued with its mate.
So if DBLE says the next instruction is double width, does that mean
that all "128 bit instructions" require 64 bits in the instruction
stream? So a sequence of say four 128 bit arithmetic instructions would require the I space of 8 instructions?
If so, I guess it is a tradeoff for not requiring register pairing, e.g.
Rn and Rn+1.
Robert Finch <robfi680@gmail.com> schrieb:
Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or
floating-point values.
I understand the temptation to go for more bits :-) What is your
instruction alignment? Bytewise so 40 bits fit, or do you have some alignment that the first instruction of a cache line is always aligned?
Having register pairs does not make the compiler writer's life easier, unfortunately.
Going with a bit result vector in any GPR for compares, then a branch on
bit-set/clear for conditional branches. Might also include branch true /
false.
Having 64 registers and 64 bit registers makes life easier for that particular task :-)
If you have that many bits available, do you still go for a load-store architecture, or do you have memory operations? This could offset the
larger size of your instructions.
Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.
Those sizes are not really a good fit for constants from programs,
where quite a few constants tend to be 32 or 64 bits. Would a
64-bit FP constant leave 26 bits empty?
Robert Finch <robfi680@gmail.com> posted:
Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or
floating-point values.
Going with a bit result vector in any GPR for compares, then a branch on
bit-set/clear for conditional branches. Might also include branch true /
false.
I have both the bit-vector compare and branch, but also a compare to zero
and branch as a single instruction. I suggest you should too, if for no
other reason than:
if( p && p->next )
Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.
My 66000 allows for occasional use of 128-bit values but is designed mainly for 64-bit and smaller.
With 32-bit instructions, I provide, {5, 16, 32, and 64}-bit constants.
Just last week we discovered a case where HW can do a better job than SW. Previously, the compiler would emit:
CVTfd Rt,Rf
FMUL Rt,Rt,#1.425D0
CVTdf Rd,Rt
Which is subject to double rounding once at the FMUL and again at the
down conversion. I though about the problem and it seems fairly easy
to gate the 24-bit fraction into the multiplier tree along with the
53-bit fraction of the constant, and then normalize and round the
result dropping out of the tree--avoiding the double rounding case.
Now, the compiler emits:
FMULf Rd,Rf,#1.425D0
saving 2 instructions alongwith the higher precision.
Desktop PC:
8C/16T: 3.7 Base, 4.3 Turbo, 112GB RAM (just, not very fast RAM)
Rarely reaches turbo
pretty much only happens if just running a single thread...
With all cores running stuff in the background:
Idles around 3.6 to 3.8.
Laptop:
4C/8T, 1.9 GHz Base, 4.2 GHz Turbo
If power set to performance, reaches turbo a lot more easily,
and with multi-core workloads.
But, puts out a lot of heat while doing so...
If set to Efficiency, mostly stays below 3 GHz.
As noted, the laptop is surprisingly speedy for how cheap it was.
At this point, the discussion is academic, as Robert has said he has 6
bit register specifiers in the instructions.
But my issue had nothing
to do with SIMD registers, as he said he supported 128 bit arithmetic
and the "high" registers were used for that.
<snip>>> My new Linux box has 64 cores at 4.5 GHz and 96GB of DRAM.
<snip>
Desktop PC:
8C/16T: 3.7 Base, 4.3 Turbo, 112GB RAM (just, not very fast RAM)
Rarely reaches turbo
pretty much only happens if just running a single thread...
With all cores running stuff in the background:
Idles around 3.6 to 3.8.
Laptop:
4C/8T, 1.9 GHz Base, 4.2 GHz Turbo
If power set to performance, reaches turbo a lot more easily,
and with multi-core workloads.
But, puts out a lot of heat while doing so...
If set to Efficiency, mostly stays below 3 GHz.
As noted, the laptop is surprisingly speedy for how cheap it was.
For my latest PC I bought a gaming machine – i7-14700KF CPU (20 cores).
32 GB RAM, 16GB graphics RAM. 3.4 GHz (5.6 GHz in turbo mode). More RAM
was needed, my last machine only had 16GB, found it using about 20GB. I
did not want to spring for a machine with even more RAM, they tended to
be high-end machines.
On 2025-10-29 2:15 p.m., Thomas Koenig wrote:
Robert Finch <robfi680@gmail.com> schrieb:
Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, >>> 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or
floating-point values.
I understand the temptation to go for more bits :-) What is your
instruction alignment? Bytewise so 40 bits fit, or do you have some
alignment that the first instruction of a cache line is always aligned?
The 40-bit instructions are byte aligned. This does add more shifting in
the align stage. Once shifted though instructions are easily peeled off
from fixed positions. One consequence is jump targets must be byte
aligned OR routines could be required to be 32-bit aligned for instance.>
If you have that many bits available, do you still go for a load-store
architecture, or do you have memory operations? This could offset the
larger size of your instructions.
It is load/store with no memory ops excepting possibly atomic memory ops.>
I found that 16-bit immediates could be encoded instead of 10-bit.Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.
Those sizes are not really a good fit for constants from programs,
where quite a few constants tend to be 32 or 64 bits. Would a
64-bit FP constant leave 26 bits empty?
So, now there are 16,56,96 and 136 bit constants possible. The 56-bitconstant likely has enough range for most 64-bit ops.
Otherwise using
a 96-bit constant for 64-bit ops would leave the upper 32-bit of the constant unused.
136 bit constants may not be implemented, but a size
code is reserved for that size.
Robert Finch <robfi680@gmail.com> schrieb:
On 2025-10-29 2:15 p.m., Thomas Koenig wrote:
Robert Finch <robfi680@gmail.com> schrieb:
Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, >>>> 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named >>>> as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high >>>> is r33). Sameo for other registers. GPRs may contain either integer or >>>> floating-point values.
I understand the temptation to go for more bits :-) What is your
instruction alignment? Bytewise so 40 bits fit, or do you have some
alignment that the first instruction of a cache line is always aligned?
The 40-bit instructions are byte aligned. This does add more shifting in
the align stage. Once shifted though instructions are easily peeled off
from fixed positions. One consequence is jump targets must be byte
aligned OR routines could be required to be 32-bit aligned for instance.>
That raises an interesting question. If you want to align a branch
target on a 32-bit boundary, or even a cache line, how do you fill
up the rest? If all instructions are 40 bits, you cannot have a
NOP that is not 40 bits, so there would need to be a jump before
a gap that is does not fit 40 bits.
On 2025-10-29 2:33 p.m., MitchAlsup wrote:
Robert Finch <robfi680@gmail.com> posted:
Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, >> 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or
floating-point values.
Going with a bit result vector in any GPR for compares, then a branch on >> bit-set/clear for conditional branches. Might also include branch true / >> false.
I have both the bit-vector compare and branch, but also a compare to zero and branch as a single instruction. I suggest you should too, if for no other reason than:
if( p && p->next )
Yes, I was going to have at least branch on register 0 (false) 1 (true)
as there is encoding room to support it. It does add more cases in the branch eval, but is probably well worth it.
Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.
My 66000 allows for occasional use of 128-bit values but is designed mainly for 64-bit and smaller.
Following the same philosophy. Expecting only some use for 128-bit
floats. Integers can only handle 8,16,32, or 64-bits.
With 32-bit instructions, I provide, {5, 16, 32, and 64}-bit constants.
Just last week we discovered a case where HW can do a better job than SW. Previously, the compiler would emit:
CVTfd Rt,Rf
FMUL Rt,Rt,#1.425D0
CVTdf Rd,Rt
Which is subject to double rounding once at the FMUL and again at the
down conversion. I though about the problem and it seems fairly easy
to gate the 24-bit fraction into the multiplier tree along with the
53-bit fraction of the constant, and then normalize and round the
result dropping out of the tree--avoiding the double rounding case.
Now, the compiler emits:
FMULf Rd,Rf,#1.425D0
saving 2 instructions along with the higher precision.
Improves the accuracy? of algorithms, but seems a bit specific to me.
Are there other instruction sequence where double-rounding would be good
to avoid?
Seems like HW could detect the sequence and fuse the instructions.--- Synchronet 3.21a-Linux NewsLink 1.2
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
At this point, the discussion is academic, as Robert has said he has 6
bit register specifiers in the instructions.
He could still make these registers have 128 bits rather than pairing registers for 128-bit operation.
But my issue had nothing
to do with SIMD registers, as he said he supported 128 bit arithmetic
and the "high" registers were used for that.
As far as waste etc. is concerned, it does not matter if the 128-bit operation is a SIMD operation or a scalar 128-bit operation.
Intel designed SSE with scalar instructions that use only 32 bits out
of the 128 bits available; SSE2 with 64-bit scalar instructions, AVX
(and AVX2) with 32-bit and 64-bit scalar operations in a 256-bit
register, and various AVX-512 variants with 32-bit and 64-bit scalars,
and 128-bit and 256-bit operations in addition to the 512-bit ones.
They are obviously not worried about waste.
- anton--- Synchronet 3.21a-Linux NewsLink 1.2
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
At this point, the discussion is academic, as Robert has said he has 6
bit register specifiers in the instructions.
He could still make these registers have 128 bits rather than pairing
registers for 128-bit operation.
But my issue had nothing
to do with SIMD registers, as he said he supported 128 bit arithmetic
and the "high" registers were used for that.
As far as waste etc. is concerned, it does not matter if the 128-bit
operation is a SIMD operation or a scalar 128-bit operation.
Intel designed SSE with scalar instructions that use only 32 bits out
of the 128 bits available; SSE2 with 64-bit scalar instructions, AVX
(and AVX2) with 32-bit and 64-bit scalar operations in a 256-bit
register, and various AVX-512 variants with 32-bit and 64-bit scalars,
and 128-bit and 256-bit operations in addition to the 512-bit ones.
They are obviously not worried about waste.
Which only goes to prove that x86 is not IRSC.
- anton
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Intel designed SSE with scalar instructions that use only 32 bits out
of the 128 bits available; SSE2 with 64-bit scalar instructions, AVX
(and AVX2) with 32-bit and 64-bit scalar operations in a 256-bit
register, and various AVX-512 variants with 32-bit and 64-bit scalars,
and 128-bit and 256-bit operations in addition to the 512-bit ones.
They are obviously not worried about waste.
Which only goes to prove that x86 is not IRSC.
Thomas Koenig <tkoenig@netcologne.de> writes:
Robert Finch <robfi680@gmail.com> schrieb:
On 2025-10-29 2:15 p.m., Thomas Koenig wrote:That raises an interesting question. If you want to align a branch
Robert Finch <robfi680@gmail.com> schrieb:The 40-bit instructions are byte aligned. This does add more shifting in >>> the align stage. Once shifted though instructions are easily peeled off
Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, >>>>> 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named >>>>> as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high >>>>> is r33). Sameo for other registers. GPRs may contain either integer or >>>>> floating-point values.
I understand the temptation to go for more bits :-) What is your
instruction alignment? Bytewise so 40 bits fit, or do you have some
alignment that the first instruction of a cache line is always aligned? >>>
from fixed positions. One consequence is jump targets must be byte
aligned OR routines could be required to be 32-bit aligned for instance.> >>
target on a 32-bit boundary, or even a cache line, how do you fill
up the rest? If all instructions are 40 bits, you cannot have a
NOP that is not 40 bits, so there would need to be a jump before
a gap that is does not fit 40 bits.
iCache lines could be a multiple of 5-bytes in size (e.g. 80 bytes
instead of 64).
Alpha avoids wasting register bits for some idioms by keeping up to 8
bytes in a register in SIMD style (a few years before the wave of SIMD extensions across the industry), but still provides no direct name for
the individual bytes of a register.
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Intel designed SSE with scalar instructions that use only 32 bits out
of the 128 bits available; SSE2 with 64-bit scalar instructions, AVX
(and AVX2) with 32-bit and 64-bit scalar operations in a 256-bit
register, and various AVX-512 variants with 32-bit and 64-bit scalars,
and 128-bit and 256-bit operations in addition to the 512-bit ones.
They are obviously not worried about waste.
Which only goes to prove that x86 is not RISC.
I don't see that following at all, but it inspired a closer look at
the usage/waste of register bits in RISCs:
Every 64-bit RISC starting with MIPS-IV and Alpha, wastes a lot of
precious register bits by keeping 8-bit, 16-bit, and 32-bit values in
64-bit registers rather than following the idea of Intel and Robert
Finch of splitting the 64-bit register in the double number of 32-bit registers; this idea can be extended to eliminate waste by having the quadruple number of 16-bit registers that can be joined into 32-bit
anbd 64-bit registers when needed, or even better, the octuple number
of 8-bit registers that can be joined to 16-bit, 32-bit, and 64-bit registers. We can even ressurrect the character-oriented or
digit-oriented architectures of the 1950s.
Intel split AX into AL and AH, similar for BX, CX, and DX, but not for
SI, DI, BP, and SP.
In the 32-bit extension, they did not add ways to
access the third and fourth byte, or the second wyde (16-bit value).
In the 64-bit extension, AMD added ways to access the low byte of
every register (in addition to AH-DH), but no way to access the second
byte of other registers than RAX-RDX, nor ways to access higher wydes,
or 32-bit units. Apparently they were not concerned about this kind
of waste. For the 8086 the explanation is not trying to avoid waste,
but an easy automatic mapping from 8080 code to 8086 code.
Writing to AL-DL or AX-DX,SI,DI,BP,SP leaves the other bits of the
32-bit register alone, which one can consider to be useful for storing
data in those bits (and in case of AL, AH actually provides a
conventient way to access some of the bits, and vice versa), but leads
to partial-register stalls. The hardware contains fast paths for some
common cases of partial-register writes, but AFAIK AH-DH do not get
fast paths in most CPUs.
By contrast, RISCs waste the other 24 of 56 bits on a byte load by zero-extending or sign-extending the byte.
Alpha avoids wasting register bits for some idioms by keeping up to 8
bytes in a register in SIMD style (a few years before the wave of SIMD extensions across the industry), but still provides no direct name for
the individual bytes of a register.
IIRC the original HPPA has 32 or so 64-bit FP registers, which they--- Synchronet 3.21a-Linux NewsLink 1.2
then split into 58? 32-bit FP registers. I don't know how they
further evolved that feature.
- anton
Scott Lurndal <scott@slp53.sl.home> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
Robert Finch <robfi680@gmail.com> schrieb:
On 2025-10-29 2:15 p.m., Thomas Koenig wrote:That raises an interesting question. If you want to align a branch >>target on a 32-bit boundary, or even a cache line, how do you fill
Robert Finch <robfi680@gmail.com> schrieb:The 40-bit instructions are byte aligned. This does add more shifting in >>> the align stage. Once shifted though instructions are easily peeled off >>> from fixed positions. One consequence is jump targets must be byte
Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named >>>>> as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high >>>>> is r33). Sameo for other registers. GPRs may contain either integer or >>>>> floating-point values.
I understand the temptation to go for more bits :-) What is your
instruction alignment? Bytewise so 40 bits fit, or do you have some >>>> alignment that the first instruction of a cache line is always aligned? >>>
aligned OR routines could be required to be 32-bit aligned for instance.> >>
up the rest? If all instructions are 40 bits, you cannot have a
NOP that is not 40 bits, so there would need to be a jump before
a gap that is does not fit 40 bits.
iCache lines could be a multiple of 5-bytes in size (e.g. 80 bytes
instead of 64).
There is a cache level (L2 usually, I believe) when icache and
dcache are no longer separate. Wouldn't this cause problems
or inefficiencies?
According to my understanding, EV4 had no SIMD-style instructions.
Michael S <already5chosen@yahoo.com> writes:
According to my understanding, EV4 had no SIMD-style instructions.
My understanding ist that CMPBGE and ZAP(NOT), both SIMD-style
instructions, were already present in EV4.
The architecture
description <https://download.majix.org/dec/alpha_arch_ref.pdf> does
not say that some implementations don't include these instructons in hardware, whereas for the Multimedia support instructions (Section
4.13), the reference does say that.
- anton
On Thu, 30 Oct 2025 22:19:18 GMT...
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
My understanding ist that CMPBGE and ZAP(NOT), both SIMD-style
instructions, were already present in EV4.
I didn't consider these instructions as SIMD. May be, I should have.
Looks like these instructions are intended to accelerated string
processing. That's unusual for the first wave of SIMD extensions.
Michael S <already5chosen@yahoo.com> writes:
On Thu, 30 Oct 2025 22:19:18 GMT...
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
My understanding ist that CMPBGE and ZAP(NOT), both SIMD-style
instructions, were already present in EV4.
I didn't consider these instructions as SIMD. May be, I should have.
They definitely are, but they were not touted as such at the time, and
they use the GPRs, unlike most SIMD extensions to instruction sets.
Looks like these instructions are intended to accelerated string
processing. That's unusual for the first wave of SIMD extensions.
Yes. This was pre-first-wave. The Alpha architects just wanted to
speed up some common operations that would otherwise have been
relatively slow thanks to Alpha initially not having BWX instructions. Ironically, when Alpha showed a particularly good result on some
benchmark (maybe Dhrystone), someone claimed that these string
instructions gave Alpha an unfair advantage.
- anton--- Synchronet 3.21a-Linux NewsLink 1.2
In a lot of the cases, I was using an 8-bit indexed color or color-cell mode. For indexed color, one needs to send each image through a palette conversion (to the OS color palette); or run a color-cell encoder.
Mostly because the display HW used 128K of VRAM.
And, even if RAM backed, there are bandwidth problems with going bigger;
so higher-resolutions had typically worked to reduce the bits per pixel:
320x200: 16 bpp
640x400: 4 bpp (color cell), 8 bpp (uses 256K, sorta works)
800x600: 2 or 4 bpp color-cell
1024x768: 1 bpp monochrome, other experiments (*1)
Or, use the 2 bpp mode, for 192K.
*1: Bayer Pattern Mode/Logic (where the pattern of pixels also encodes
the color);
One possibility also being to use an indexed color pair for every 8x8, allowing for a 1.25 bpp color cell mode.
Robert Finch <robfi680@gmail.com> posted:
Improves the accuracy? of algorithms, but seems a bit specific to me.
It is down in the 1% footprint area.
Are there other instruction sequence where double-rounding would be good
to avoid?
Back when I joined Moto (1983) there was a lot of talk about double
roundings and how it could screw up various algorithms but mainly in
the 64-bit versus 80-bit stuff of 68881, where you got 11-more bits
of precision and thus took a change of 2/2^10 of a double rounding.
Today with 32-bit versus 64-bit you take a chance of 2/2^28 so the
problem is greatly ameliorated although technically still present.
On Thu, 30 Oct 2025 16:46:14 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Alpha avoids wasting register bits for some idioms by keeping up to 8
bytes in a register in SIMD style (a few years before the wave of SIMD
extensions across the industry), but still provides no direct name for
the individual bytes of a register.
According to my understanding, EV4 had no SIMD-style instructions.
They were introduced in EV5 (Jan 1995). Which makes it only ~6 months
ahead of VIS in UltraSPARC.
MitchAlsup wrote:
Robert Finch <robfi680@gmail.com> posted:
Improves the accuracy? of algorithms, but seems a bit specific to me.
It is down in the 1% footprint area.
Are there other instruction sequence where double-rounding would be good >> to avoid?
Back when I joined Moto (1983) there was a lot of talk about double roundings and how it could screw up various algorithms but mainly in
the 64-bit versus 80-bit stuff of 68881, where you got 11-more bits
of precision and thus took a change of 2/2^10 of a double rounding.
Today with 32-bit versus 64-bit you take a chance of 2/2^28 so the
problem is greatly ameliorated although technically still present.
Actually, for the five required basic operations, you can always do the
op in the next higher precision, then round again down to the target,
and get exactly the same result.
This is because the mantissa lengths (including the hidden bit) increase
to at least 2n+2:
f16 1:5:10 (1+10=11, 11*2+2 = 22)
f32 1:8:23 (1+23=24, 24*2+2 = 50)
f64 1:11:52 (1+52=53, 53*2+2 = 108)
f128 1:15:112 (1+112=113)
You can however NOT use f128 FMUL + FADD to emulate f64 FMAC, since that would require a triple sized mantissa.
The Intel+Motorola 80-bit format was a bastard that made it effectively impossible to produce bit-for-bit identical results even when the FPU
was set to 64-bit precision.
Terje
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Intel split AX into AL and AH, similar for BX, CX, and DX, but not for
SI, DI, BP, and SP.
{ABCD}X registers were data.
{SDBS} registers were pointer registers.
Oh and BTW:: using x86-history as justification for an architectural
feature is "bad style".
But gains the property that the whole register contains 1 proper value >{range-limited to the container size whence it came} This in turn makes >tracking values easy--in fact placing several different sized values
in a single register makes it essentially impossible to perform value >analysis in the compiler.
Terje Mathisen <terje.mathisen@tmsw.no> posted:
Actually, for the five required basic operations, you can always do the
op in the next higher precision, then round again down to the target,
and get exactly the same result.
https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf
On 10/31/2025 1:21 PM, BGB wrote:
...
In a lot of the cases, I was using an 8-bit indexed color or color-
cell mode. For indexed color, one needs to send each image through a
palette conversion (to the OS color palette); or run a color-cell
encoder. Mostly because the display HW used 128K of VRAM.
And, even if RAM backed, there are bandwidth problems with going
bigger; so higher-resolutions had typically worked to reduce the bits
per pixel:
320x200: 16 bpp
640x400: 4 bpp (color cell), 8 bpp (uses 256K, sorta works)
800x600: 2 or 4 bpp color-cell
1024x768: 1 bpp monochrome, other experiments (*1)
Or, use the 2 bpp mode, for 192K.
*1: Bayer Pattern Mode/Logic (where the pattern of pixels also encodes
the color);
One possibility also being to use an indexed color pair for every 8x8,
allowing for a 1.25 bpp color cell mode.
Expanding on this:
Idea 1, original:
Each group of 2x2 pixels understood as:
G R
B G
With each pixel alternating color.
But, slightly better for quality is to operate on blocks of 4x4 pixels,
with the pixel bits encoding color indirectly for the whole 4x4 block:
G R G B
B G R G
G R G B
B G R G
So, if >= 4 G bits are set, G is High.
So, if >= 2 R bits are set, R is High.
So, if >= 2 B bits are set, B is High.
If > 8 bits are set, I is high.
The non-set pixels usually assuming either 0000 (Black) or 1000 (Dark
Grey) depending on I bit. Or, a low intensity version of the main color
if over 75% of a given bit are set in a given way (say, for mostly flat color blocks).
Still kinda sucks, but allows a crude approximation of 16 color graphics
at 1 bpp...
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
Terje Mathisen <terje.mathisen@tmsw.no> posted:
Actually, for the five required basic operations, you can always do the
op in the next higher precision, then round again down to the target,
and get exactly the same result.
https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf
The PowerISA version 3.0 introduced rounding to odd for its 128-bit
floating point arithmetic, for that very reason (I assume).
Thomas Koenig wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
Terje Mathisen <terje.mathisen@tmsw.no> posted:
Actually, for the five required basic operations, you can always
do the op in the next higher precision, then round again down to
the target, and get exactly the same result.
https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf
The PowerISA version 3.0 introduced rounding to odd for its 128-bit floating point arithmetic, for that very reason (I assume).
Rounding to odd is basically the same as rounding to sticky, i.e if
there are any trailing 1 bits in the exact result, then put that in
the ulp position.
We have known since before the 1978 ieee754 standard that
guard+sticky (plus sign and ulp) is enough to get the rounding
correct in all modes.
The single exception is when rounding up from the maximum magnitude
value to inf should be suppressed, there you do in fact need to check
all the bits.
Terje
On 10/31/2025 2:32 PM, BGB wrote:
On 10/31/2025 1:21 PM, BGB wrote:
...
In a lot of the cases, I was using an 8-bit indexed color or color-
cell mode. For indexed color, one needs to send each image through a
palette conversion (to the OS color palette); or run a color-cell
encoder. Mostly because the display HW used 128K of VRAM.
And, even if RAM backed, there are bandwidth problems with going
bigger; so higher-resolutions had typically worked to reduce the bits
per pixel:
320x200: 16 bpp
640x400: 4 bpp (color cell), 8 bpp (uses 256K, sorta works)
800x600: 2 or 4 bpp color-cell
1024x768: 1 bpp monochrome, other experiments (*1)
Or, use the 2 bpp mode, for 192K.
*1: Bayer Pattern Mode/Logic (where the pattern of pixels also
encodes the color);
One possibility also being to use an indexed color pair for every
8x8, allowing for a 1.25 bpp color cell mode.
Expanding on this:
Idea 1, original:
Each group of 2x2 pixels understood as:
G R
B G
With each pixel alternating color.
But, slightly better for quality is to operate on blocks of 4x4
pixels, with the pixel bits encoding color indirectly for the whole
4x4 block:
G R G B
B G R G
G R G B
B G R G
So, if >= 4 G bits are set, G is High.
So, if >= 2 R bits are set, R is High.
So, if >= 2 B bits are set, B is High.
If > 8 bits are set, I is high.
The non-set pixels usually assuming either 0000 (Black) or 1000 (Dark
Grey) depending on I bit. Or, a low intensity version of the main
color if over 75% of a given bit are set in a given way (say, for
mostly flat color blocks).
Still kinda sucks, but allows a crude approximation of 16 color
graphics at 1 bpp...
Well, anyways, here is me testing with another variation of the idea
(after thinking about it again).
Using a joke image as a test case here...
https://x.com/cr88192/status/1984694932666261839
This variation uses:
Y R
B G
In this case tiling as:
Y R Y R ...
B G B G ...
Y R Y R ...
B G B G ...
...
Where, Y is a pure luma value.
May or may not use this, or:
Y R B G Y R B G
B G Y R B G Y R
...
But, prior pattern is simpler to deal with.
Note that having every line follow the same pattern (with no
alternation) would lead to obvious vertical lines in the output.
With a different (slightly more complicated color recovery algorithm),
and was operating on 8x8 pixel blocks.
With 4x4, there is effectively 4 bits per channel, which is enough to recover 1 bit of color per channel.
With 8x8, there are 16 bits, and it is possible to recover ~ 3 bits per channel, allowing for roughly a RGB333 color space (though, the vectors
are normalized here).
Having both a Y and G channel slightly helps with the color-recovery process; and allows a way to signal a monochrome block (if Y==G, the
block is assumed to be monochrome, and the R/B bits can be used more
freely for expressing luma).
Where:
Chroma accuracy comes at the expense of luma accuracy;
An increased colorspace comes at the cost of spatial resolution of chroma; ...
Dealing with chroma does have the effect of making the dithering process more complicated. As noted, reliable recovery of the color vector is
itself a bit fiddly (and is very sensitive to the encoder side dither process).
The former image was itself an example of an artifact caused by the dithering process, which in this case was over-boosting the green
channel (and rotating the dither matrix would result in drastic color shifts). The later image was mostly after I realized the issue with the dither pattern, and modified how it was being handled (replacing the use
of an 8x8 ordered dither with a 4x4 ordered dither, and then rotating
the matrix for each channel).
Image quality isn't great, but then again, not sure how to do that much better with a naive 1 bit/pixel encoding.
I guess, an open question here is whether the color-recovery algorithm
would be practical for hardware / FPGA.
One possible could be:
Use LUT4 to map 4b -> 2b (as a count)
Then, map 2x2b -> 3b (adder)
Then, map 2x3b -> 4b (adder), then discard LSB.
Then, select max or R/G/B/Y;
This is used as an inverse normalization scale.
Feed each value and scale through a LUT (for R/G/B)
Getting a 5-bit scaled RGB;
Roughly: (Val<<5)/Max
Compose a 5-bit RGB555 value used for each pixel that is set.
Actual pixel decoding process works the same as with 8x8 blocks of 1 bit monochome, selecting minimum or maximum color based on each bit.
Possibly, Y could also be used to select "relative" minimum and maximum values, vs full intensity and black, but this would add more logic complexity.
Pros/Cons:
+: Looks better than per-pixel Bayer-RGB
+: Looks better than 4x4 RGBI
-: Would require more complex decoder logic;
-: Requires specialized dither logic to not look like broken crap.
-: Doesn't give passable results if handed naive grayscale dithering.
Per-Pixel RGB still holds up OK with naive grayscale dither.
But, this approach is a lot more particular.
the RGBI approach seems intermediate, more likely to decode grayscale patterns as gray.
I guess a more open question is if such a thing could be useful (it is pretty far down the image-quality scale). But, OTOH, with simpler (non- randomized) dither patterns; it can LZ compress OK (depending on image,
can get 0.1 to 0.8 bpp; which is generally JPEG territory).
If combined with delta encoding or similar; could almost be adapted into
a very crappy video codec.
Well, or LZ4, where (at 320x200) one could potentially hold several
frames of video in a 64K sliding window.
But, image quality might be unacceptably poor. Also if decoded in
software, the color-reconstruction is likely to be more computationally expensive than just using a CRAM style codec (while also giving worse
image quality).
More just interesting that I was able to get things "almost half-way passable" from 1 bpp monochrome.
...
On Sun, 2 Nov 2025 11:36:36 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Thomas Koenig wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
Terje Mathisen <terje.mathisen@tmsw.no> posted:
Actually, for the five required basic operations, you can always
do the op in the next higher precision, then round again down to
the target, and get exactly the same result.
https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf
The PowerISA version 3.0 introduced rounding to odd for its 128-bit
floating point arithmetic, for that very reason (I assume).
Rounding to odd is basically the same as rounding to sticky, i.e if
there are any trailing 1 bits in the exact result, then put that in
the ulp position.
We have known since before the 1978 ieee754 standard that
guard+sticky (plus sign and ulp) is enough to get the rounding
correct in all modes.
The single exception is when rounding up from the maximum magnitude
value to inf should be suppressed, there you do in fact need to check
all the bits.
Terje
People use names like guard and sticky bits and sometimes also rounding
bit (e.g. in Wikipedia article) without explanation, as if everybody
had agreed about what they mean. But I don't think that everybody
really agree.
Michael S wrote:
On Sun, 2 Nov 2025 11:36:36 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Thomas Koenig wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
Terje Mathisen <terje.mathisen@tmsw.no> posted:
Actually, for the five required basic operations, you can always
do the op in the next higher precision, then round again down to
the target, and get exactly the same result.
https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf
The PowerISA version 3.0 introduced rounding to odd for its
128-bit floating point arithmetic, for that very reason (I
assume).
Rounding to odd is basically the same as rounding to sticky, i.e if
there are any trailing 1 bits in the exact result, then put that in
the ulp position.
We have known since before the 1978 ieee754 standard that
guard+sticky (plus sign and ulp) is enough to get the rounding
correct in all modes.
The single exception is when rounding up from the maximum magnitude
value to inf should be suppressed, there you do in fact need to
check all the bits.
Terje
People use names like guard and sticky bits and sometimes also
rounding bit (e.g. in Wikipedia article) without explanation, as if everybody had agreed about what they mean. But I don't think that
everybody really agree.
Within the 754 working group the definition is totally clear:
Guard is the first bit after the normal mantissa.
Sticky is the bit following the guard bit, it is generated by OR'ing together all subsequent bits in the exact/infinitely precise result.
I.e if an exact result is exactly halfway between two representable
numbers, the Guard bit will be set and Sticky unset.
Ulp (Unit in Last Place)) is the final mantissa bit
Sign is of course the sign in the Sign-Magnitude format used for all
fp numbers.
This means that those four bits in combination suffices to separate
between rounding directions:
Default rounding is nearest or even: (In this case Sign does not
matter.)
Ulp | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
Guard | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 |
Sticky | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
Round | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |
Terje
On Sun, 2 Nov 2025 16:09:10 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Michael S wrote:
On Sun, 2 Nov 2025 11:36:36 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Thomas Koenig wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
Terje Mathisen <terje.mathisen@tmsw.no> posted:
Actually, for the five required basic operations, you can always >>>>>>> do the op in the next higher precision, then round again down to >>>>>>> the target, and get exactly the same result.
https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf
The PowerISA version 3.0 introduced rounding to odd for its
128-bit floating point arithmetic, for that very reason (I
assume).
Rounding to odd is basically the same as rounding to sticky, i.e if
there are any trailing 1 bits in the exact result, then put that in
the ulp position.
We have known since before the 1978 ieee754 standard that
guard+sticky (plus sign and ulp) is enough to get the rounding
correct in all modes.
The single exception is when rounding up from the maximum magnitude
value to inf should be suppressed, there you do in fact need to
check all the bits.
Terje
People use names like guard and sticky bits and sometimes also
rounding bit (e.g. in Wikipedia article) without explanation, as if
everybody had agreed about what they mean. But I don't think that
everybody really agree.
Within the 754 working group the definition is totally clear:
I could believe that there is consensus about these names between
current members of 754 working group. But nothing of that sort is
mentioned in the text of the Standard. Which among other things means
that you can not rely on being understood even by new members of 754
working group.
Guard is the first bit after the normal mantissa.
Sticky is the bit following the guard bit, it is generated by OR'ing
together all subsequent bits in the exact/infinitely precise result.
I.e if an exact result is exactly halfway between two representable
numbers, the Guard bit will be set and Sticky unset.
Ulp (Unit in Last Place)) is the final mantissa bit
Sign is of course the sign in the Sign-Magnitude format used for all
fp numbers.
This means that those four bits in combination suffices to separate
between rounding directions:
Default rounding is nearest or even: (In this case Sign does not
matter.)
Ulp | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
Guard | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 |
Sticky | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
Round | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |
Terje
I mostly use ULP/Guard/Sticky in the same meaning. Except when I use
them, esp. Guard, differently.
Given the choice, [in the context of binary floating point] I'd rather
not use the term 'guard' at all. Names like 'rounding bit' or
'half-ULP' are far more self-describing.
On 2025-11-02 3:21 a.m., BGB wrote:
On 10/31/2025 2:32 PM, BGB wrote:I think your support for graphics is interesting; something to keep in
On 10/31/2025 1:21 PM, BGB wrote:
...
In a lot of the cases, I was using an 8-bit indexed color or color-
cell mode. For indexed color, one needs to send each image through a
palette conversion (to the OS color palette); or run a color-cell
encoder. Mostly because the display HW used 128K of VRAM.
And, even if RAM backed, there are bandwidth problems with going
bigger; so higher-resolutions had typically worked to reduce the
bits per pixel:
320x200: 16 bpp
640x400: 4 bpp (color cell), 8 bpp (uses 256K, sorta works)
800x600: 2 or 4 bpp color-cell
1024x768: 1 bpp monochrome, other experiments (*1)
Or, use the 2 bpp mode, for 192K.
*1: Bayer Pattern Mode/Logic (where the pattern of pixels also
encodes the color);
One possibility also being to use an indexed color pair for every
8x8, allowing for a 1.25 bpp color cell mode.
Expanding on this:
Idea 1, original:
Each group of 2x2 pixels understood as:
G R
B G
With each pixel alternating color.
But, slightly better for quality is to operate on blocks of 4x4
pixels, with the pixel bits encoding color indirectly for the whole
4x4 block:
G R G B
B G R G
G R G B
B G R G
So, if >= 4 G bits are set, G is High.
So, if >= 2 R bits are set, R is High.
So, if >= 2 B bits are set, B is High.
If > 8 bits are set, I is high.
The non-set pixels usually assuming either 0000 (Black) or 1000 (Dark
Grey) depending on I bit. Or, a low intensity version of the main
color if over 75% of a given bit are set in a given way (say, for
mostly flat color blocks).
Still kinda sucks, but allows a crude approximation of 16 color
graphics at 1 bpp...
Well, anyways, here is me testing with another variation of the idea
(after thinking about it again).
Using a joke image as a test case here...
https://x.com/cr88192/status/1984694932666261839
This variation uses:
Y R
B G
In this case tiling as:
Y R Y R ...
B G B G ...
Y R Y R ...
B G B G ...
...
Where, Y is a pure luma value.
May or may not use this, or:
Y R B G Y R B G
B G Y R B G Y R
...
But, prior pattern is simpler to deal with.
Note that having every line follow the same pattern (with no
alternation) would lead to obvious vertical lines in the output.
With a different (slightly more complicated color recovery algorithm),
and was operating on 8x8 pixel blocks.
With 4x4, there is effectively 4 bits per channel, which is enough to
recover 1 bit of color per channel.
With 8x8, there are 16 bits, and it is possible to recover ~ 3 bits
per channel, allowing for roughly a RGB333 color space (though, the
vectors are normalized here).
Having both a Y and G channel slightly helps with the color-recovery
process; and allows a way to signal a monochrome block (if Y==G, the
block is assumed to be monochrome, and the R/B bits can be used more
freely for expressing luma).
Where:
Chroma accuracy comes at the expense of luma accuracy;
An increased colorspace comes at the cost of spatial resolution of
chroma;
...
Dealing with chroma does have the effect of making the dithering
process more complicated. As noted, reliable recovery of the color
vector is itself a bit fiddly (and is very sensitive to the encoder
side dither process).
The former image was itself an example of an artifact caused by the
dithering process, which in this case was over-boosting the green
channel (and rotating the dither matrix would result in drastic color
shifts). The later image was mostly after I realized the issue with
the dither pattern, and modified how it was being handled (replacing
the use of an 8x8 ordered dither with a 4x4 ordered dither, and then
rotating the matrix for each channel).
Image quality isn't great, but then again, not sure how to do that
much better with a naive 1 bit/pixel encoding.
I guess, an open question here is whether the color-recovery algorithm
would be practical for hardware / FPGA.
One possible could be:
Use LUT4 to map 4b -> 2b (as a count)
Then, map 2x2b -> 3b (adder)
Then, map 2x3b -> 4b (adder), then discard LSB.
Then, select max or R/G/B/Y;
This is used as an inverse normalization scale.
Feed each value and scale through a LUT (for R/G/B)
Getting a 5-bit scaled RGB;
Roughly: (Val<<5)/Max
Compose a 5-bit RGB555 value used for each pixel that is set.
Actual pixel decoding process works the same as with 8x8 blocks of 1
bit monochome, selecting minimum or maximum color based on each bit.
Possibly, Y could also be used to select "relative" minimum and
maximum values, vs full intensity and black, but this would add more
logic complexity.
Pros/Cons:
+: Looks better than per-pixel Bayer-RGB
+: Looks better than 4x4 RGBI
-: Would require more complex decoder logic;
-: Requires specialized dither logic to not look like broken crap.
-: Doesn't give passable results if handed naive grayscale dithering. >>
Per-Pixel RGB still holds up OK with naive grayscale dither.
But, this approach is a lot more particular.
the RGBI approach seems intermediate, more likely to decode grayscale
patterns as gray.
I guess a more open question is if such a thing could be useful (it is
pretty far down the image-quality scale). But, OTOH, with simpler
(non- randomized) dither patterns; it can LZ compress OK (depending on
image, can get 0.1 to 0.8 bpp; which is generally JPEG territory).
If combined with delta encoding or similar; could almost be adapted
into a very crappy video codec.
Well, or LZ4, where (at 320x200) one could potentially hold several
frames of video in a 64K sliding window.
But, image quality might be unacceptably poor. Also if decoded in
software, the color-reconstruction is likely to be more
computationally expensive than just using a CRAM style codec (while
also giving worse image quality).
More just interesting that I was able to get things "almost half-way
passable" from 1 bpp monochrome.
...
mind for displays with limited RAM.
I use a high-speed DDR memory interface and video fifo (line cache).
Colors are broken into components specifying the number of bits per component (up to 10) in CRs. Colors are passed around as 32-bit values
for video processing. Using the colors directly is much easier than
dealing with dithered colors.
The graphics accelerator just spits out colors to the frame buffer
without needing to go through a dithering stage.
No real need to go much beyond RGB555, as the FPGA boards have VGA DACs
that generally fall below this (Eg: 4 bit/channel on the Nexys A7). And, 2-bit for many VGA PMods (PMod allowing 8 IO pins, so RGB222+H/V Sync;
or needing to use 2 PMOD connections for the VGA). The usual workaround
was also to perform dithering while driving the VGA output (with ordered dither in the Verilog).
Generally, the text mode operates in a 640x200 mode with 8x8 + 128b
cells, so 32K of VRAM used (for 80x25 cells).
In this case, a 40x25 color-cell mode (with 256-bit cells) could be used
for graphics (32K). Early on, this was used as the graphics mode for
Doom and similar, before I later expanded VRAM to 128K and switched to 320x200 Hicolor.
The bitmap modes are non-raster, generally with pixels packed into 8x8
or 4x4 blocks.
4x4:
16bpp: pixels in raster order.
8bpp: raster order, 32-bits per row
4bpp: Raster order, 16-bits per row
And, 8x8:
4bpp: Takes 16bpp layout, splits each pixel into 2x2.
2bpp: Takes 8bpp layout, splits each pixel into 2x2.
1bpp: Raster order, 1bpp, but same order as text glyphs.
With MSB in upper left, LSB in lower right.
On 2025-11-02 3:58 p.m., BGB wrote:
<snip>
No real need to go much beyond RGB555, as the FPGA boards have VGA
DACs that generally fall below this (Eg: 4 bit/channel on the Nexys
A7). And, 2-bit for many VGA PMods (PMod allowing 8 IO pins, so
RGB222+H/V Sync; or needing to use 2 PMOD connections for the VGA).
The usual workaround was also to perform dithering while driving the
VGA output (with ordered dither in the Verilog).
I am using an HDMI interface so the monitor is fed 24-bit RGB digitally.
I tried to get a display channel interface working but no luck. VGA is
so old.
Have you tried dithering based on the frame (temporal dithering vs
space-al dithering)? First frame is one set of colors, the next frame is
a second set of colors. I think it may work if the refresh rate is high enough (120 Hz). IIRC I tried this a while ago and was not happy with
the results. I also tried rotating the dithering pattern around each frame.
<snip>
Generally, the text mode operates in a 640x200 mode with 8x8 + 128bFor the text mode 800x600 mode is used on my system, with 12x18 cells so that I can read the display at a distance (64x32 characters).
cells, so 32K of VRAM used (for 80x25 cells).
The font then has 64 block graphic characters of 2x3 block. Low-res
graphics can be done in text mode with the appropriate font size and
block graphics characters. Color selection is limited though.>
In this case, a 40x25 color-cell mode (with 256-bit cells) could be
used for graphics (32K). Early on, this was used as the graphics mode
for Doom and similar, before I later expanded VRAM to 128K and
switched to 320x200 Hicolor.
The bitmap modes are non-raster, generally with pixels packed into 8x8
or 4x4 blocks.
4x4:
16bpp: pixels in raster order.
8bpp: raster order, 32-bits per row
4bpp: Raster order, 16-bits per row
And, 8x8:
4bpp: Takes 16bpp layout, splits each pixel into 2x2.
2bpp: Takes 8bpp layout, splits each pixel into 2x2.
1bpp: Raster order, 1bpp, but same order as text glyphs.
With MSB in upper left, LSB in lower right.
<snip>
Michael S wrote:
I mostly use ULP/Guard/Sticky in the same meaning. Except when I use
them, esp. Guard, differently.
Given the choice, [in the context of binary floating point] I'd rather
not use the term 'guard' at all. Names like 'rounding bit' or
'half-ULP' are far more self-describing.
Guard also works for decimal FP, where you need a single Sticky bit if
the Guard digit is equal to 5.
Terje Mathisen <terje.mathisen@tmsw.no> writes:
Michael S wrote:
I mostly use ULP/Guard/Sticky in the same meaning. Except when I use
them, esp. Guard, differently.
Given the choice, [in the context of binary floating point] I'd rather
not use the term 'guard' at all. Names like 'rounding bit' or
'half-ULP' are far more self-describing.
Guard also works for decimal FP, where you need a single Sticky bit if
the Guard digit is equal to 5.
By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?
Contemplating having conditional branch instructions branch to a target value in a register instead of using a displacement.
I think this has about the same code density as having a branch to a displacement from the IP.
Using a fused compare-and-branch instruction for Qupls4
there is not
enough room in the instruction for a large branch displacement (10
bits). So, my thought is to branch to a register value instead.
There is already an add-to-instruction-pointer instruction that can be
used to generate relative addresses.
By moving the register load outside of a loop, the dynamic instruction
count can be reduced. I think this solution is a bit better than having compare and branch as two separate instructions, or having an extended constant added to the branch instruction.
One gotcha may be that the branch target needs to be predicted as it
cannot be calculated earlier in the pipeline.
The 10-bit displacement format could also be supported, but it is yet another branch instruction format. I may leave holes in the instruction--
set for future support, but I think it is best to start with just a
single format.
Code:
AIPSI R3,1234 ; add displacement to IP and store in R3 (hoist-able)
BLT R1,R2,R3 ; branch to R3 if R1 < R2
Versus:
CMP R3,R1,R2
BLT R3,displacement
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
Terje Mathisen <terje.mathisen@tmsw.no> posted:
Actually, for the five required basic operations, you can always do the >> op in the next higher precision, then round again down to the target,
and get exactly the same result.
https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf
The PowerISA version 3.0 introduced rounding to odd for its 128-bit
floating point arithmetic, for that very reason (I assume).
Contemplating having conditional branch instructions branch to a target value in a register instead of using a displacement.
I think this has about the same code density as having a branch to a displacement from the IP.
Using a fused compare-and-branch instruction for Qupls4 there is not
enough room in the instruction for a large branch displacement (10
bits). So, my thought is to branch to a register value instead.
There is already an add-to-instruction-pointer instruction that can be
used to generate relative addresses.
By moving the register load outside of a loop, the dynamic instruction
count can be reduced. I think this solution is a bit better than having compare and branch as two separate instructions, or having an extended constant added to the branch instruction.
One gotcha may be that the branch target needs to be predicted as it
cannot be calculated earlier in the pipeline.
The 10-bit displacement format could also be supported, but it is yet another branch instruction format. I may leave holes in the instruction
set for future support, but I think it is best to start with just a
single format.
Code:
AIPSI R3,1234 ; add displacement to IP and store in R3 (hoist-able)
BLT R1,R2,R3 ; branch to R3 if R1 < R2
Versus:
CMP R3,R1,R2
BLT R3,displacement
Terje Mathisen <terje.mathisen@tmsw.no> writes:
Michael S wrote:
I mostly use ULP/Guard/Sticky in the same meaning. Except when I
use them, esp. Guard, differently.
Given the choice, [in the context of binary floating point] I'd
rather not use the term 'guard' at all. Names like 'rounding bit'
or 'half-ULP' are far more self-describing.
Guard also works for decimal FP, where you need a single Sticky bit
if the Guard digit is equal to 5.
By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?
Terje Mathisen <terje.mathisen@tmsw.no> writes:
Michael S wrote:
I mostly use ULP/Guard/Sticky in the same meaning. Except when I use
them, esp. Guard, differently.
Given the choice, [in the context of binary floating point] I'd rather
not use the term 'guard' at all. Names like 'rounding bit' or
'half-ULP' are far more self-describing.
Guard also works for decimal FP, where you need a single Sticky bit if
the Guard digit is equal to 5.
By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?
Should be possible. A question is if you want to have a special
register for that (like POWER's link register),
tell the CPU
what the target is (like VEC in My66000)
just use a general
purpose register with a general-purpose instruction.
One gotcha may be that the branch target needs to be predicted as it
cannot be calculated earlier in the pipeline.
If you use a link register or a special instruction, the CPU could
do that.
On Mon, 03 Nov 2025 15:22:44 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?
I am pretty sure that by decimal FP Terje means decimal FP :-). As
defined in IEEE 754 (formerly it was in 854, but since 2008 it became a
part of the main standard).
IEEE 754 has two options for encoding of mantissa, IBM's DPD which is
a clever variation of Base 1000 and Intel's binary.
DPD encoding is considered preferable for hardware implementations
while binary encoding is easier for software implementations.
BCD is not an option, it's information density is insufficient to
supply required semantics in given size of container.
Michael S <already5chosen@yahoo.com> writes:
On Mon, 03 Nov 2025 15:22:44 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?
I am pretty sure that by decimal FP Terje means decimal FP :-). As
defined in IEEE 754 (formerly it was in 854, but since 2008 it
became a part of the main standard).
IEEE 754 has two options for encoding of mantissa, IBM's DPD which is
a clever variation of Base 1000 and Intel's binary.
DPD encoding is considered preferable for hardware implementations
while binary encoding is easier for software implementations.
BCD is not an option, it's information density is insufficient to
supply required semantics in given size of container.
How so? The B3500 supported 100 digit (400 bit) signed mantissa and
a two digit signed exponent using a BCD representation.
Michael S <already5chosen@yahoo.com> writes:
On Mon, 03 Nov 2025 15:22:44 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?
I am pretty sure that by decimal FP Terje means decimal FP :-). As
defined in IEEE 754 (formerly it was in 854, but since 2008 it became a
part of the main standard).
IEEE 754 has two options for encoding of mantissa, IBM's DPD which is
a clever variation of Base 1000 and Intel's binary.
DPD encoding is considered preferable for hardware implementations
while binary encoding is easier for software implementations.
BCD is not an option, it's information density is insufficient to
supply required semantics in given size of container.
How so? The B3500 supported 100 digit (400 bit) signed mantissa and
a two digit signed exponent using a BCD representation.
Michael S <already5chosen@yahoo.com> writes:
On Mon, 03 Nov 2025 15:22:44 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?
I am pretty sure that by decimal FP Terje means decimal FP :-). As
defined in IEEE 754 (formerly it was in 854, but since 2008 it became a
part of the main standard).
IEEE 754 has two options for encoding of mantissa, IBM's DPD which is
a clever variation of Base 1000 and Intel's binary.
DPD encoding is considered preferable for hardware implementations
while binary encoding is easier for software implementations.
BCD is not an option, it's information density is insufficient to
supply required semantics in given size of container.
How so? The B3500 supported 100 digit (400 bit) signed mantissa and
a two digit signed exponent using a BCD representation.
Scott Lurndal wrote:
Michael S <already5chosen@yahoo.com> writes:
On Mon, 03 Nov 2025 15:22:44 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?
I am pretty sure that by decimal FP Terje means decimal FP :-). As
defined in IEEE 754 (formerly it was in 854, but since 2008 it
became a part of the main standard).
IEEE 754 has two options for encoding of mantissa, IBM's DPD which
is a clever variation of Base 1000 and Intel's binary.
DPD encoding is considered preferable for hardware implementations
while binary encoding is easier for software implementations.
BCD is not an option, it's information density is insufficient to
supply required semantics in given size of container.
How so? The B3500 supported 100 digit (400 bit) signed mantissa and
a two digit signed exponent using a BCD representation.
It is needed to be comparable to binary FP:
A 64-bit double provides 54 mantissa bits, this corresponds to 16+
decimal digits, while fp128 gives us 113 bits or a smidgen over 34
digits.
The corresponding 128-bit DFP format also provides 34 decimal digts,
with an exponent range which covers 10^-6143 to 10^6144, while the 15 exponent bits in binary128 covers 2^-16k to 2^16k, corresponding to 5.9e(+/-)4931.
I.e. the DFP format has the same precision and a larger range than
BFP.
Terje
On Tue, 04 Nov 2025 15:19:08 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Michael S <already5chosen@yahoo.com> writes:
On Mon, 03 Nov 2025 15:22:44 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?
I am pretty sure that by decimal FP Terje means decimal FP :-). As
defined in IEEE 754 (formerly it was in 854, but since 2008 it
became a part of the main standard).
IEEE 754 has two options for encoding of mantissa, IBM's DPD which is
a clever variation of Base 1000 and Intel's binary.
DPD encoding is considered preferable for hardware implementations
while binary encoding is easier for software implementations.
BCD is not an option, it's information density is insufficient to
supply required semantics in given size of container.
How so? The B3500 supported 100 digit (400 bit) signed mantissa and
a two digit signed exponent using a BCD representation.
What is not clear about 'in given size of container' ?
Semantics of IEEE Decimal128 call for 33 decimal digits + 1 binary bit
to be contained within 111 bits.
With BCD encoding one would need 133 bits.
On Tue, 4 Nov 2025 16:52:18 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Scott Lurndal wrote:
Michael S <already5chosen@yahoo.com> writes:
On Mon, 03 Nov 2025 15:22:44 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?
I am pretty sure that by decimal FP Terje means decimal FP :-). As
defined in IEEE 754 (formerly it was in 854, but since 2008 it
became a part of the main standard).
IEEE 754 has two options for encoding of mantissa, IBM's DPD which
is a clever variation of Base 1000 and Intel's binary.
DPD encoding is considered preferable for hardware implementations
while binary encoding is easier for software implementations.
BCD is not an option, it's information density is insufficient to
supply required semantics in given size of container.
How so? The B3500 supported 100 digit (400 bit) signed mantissa and
a two digit signed exponent using a BCD representation.
It is needed to be comparable to binary FP:
A 64-bit double provides 54 mantissa bits, this corresponds to 16+
decimal digits, while fp128 gives us 113 bits or a smidgen over 34
digits.
The corresponding 128-bit DFP format also provides 34 decimal digts,
with an exponent range which covers 10^-6143 to 10^6144, while the 15
exponent bits in binary128 covers 2^-16k to 2^16k, corresponding to
5.9e(+/-)4931.
I.e. the DFP format has the same precision and a larger range than
BFP.
Terje
Nitpick:
In the best case, i.e. cases where mantissa of BFP is close to 2 and MS
digit of DFP =9, [relative] precision is indeed almost identical.
But in the worst case, i.e. cases where mantissa of BFP is close to 1
and MS digit of DFP =1, [relative] precision of BFP is 5 times better.
Thomas Koenig <tkoenig@netcologne.de> writes:
Should be possible. A question is if you want to have a special
register for that (like POWER's link register),
There is this idea of splitting an (indirect) branch into a
prepare-to-branch instruction and a take-branch instruction. The
prepare-to-branch instruction announces the branch target to the CPU,
and Power's mtlr and mtctr are examples of that (somewhat muddled by
the fact that the ctr register can also be used for counted loops as
well as for indirect branches), and IA-64's branch-target registers
and the instructions that move there are another example. AFAIK SPARC acquired something in this direction (touted as good for accelerating
Java) in the early 2000s. The take-branch instruction on Power is
blr/bctr.
I used to think that this kind of splitting is a good idea, and it is certainly better than a branch-delay slot or a branch with a fixed
number of delay slots.
But in practice, it turned out that Intel and AMD processors had much
better performance on indirect-branch intensive workloads in the early
2000s without this architectural feature. What happened?
The IA-32 and AMD64 microarchitects implemented indirect-branch
prediction; in the early 2000s it was based on the BTB, which these
CPUs need for fast direct branching anyway. They were not content
with that, and have implemented history-based indirect branch
predictors in the meantime, which improve the performance even more.
By contrast, Power and IA-64 implementations apparently rely on
getting the target-address early enough, and typically predict that
the indirect branch will go to the current contents of the
branch-target register when the front-end encounters the take-branch instruction; but if the prepare-to-branch instruction is in the
instruction stream just before the take-branch instruction, it takes
several cycles until the prepare-to-branch actually can move the
target to the branch-target register. In case of an OoO
implementation, the number of cycles tends to be longer. It's
essentially a similar latency as in a branch misprediction.
That all would not be so bad, if the compilers would move the prepare-to-branch instructions sufficiently far away from the
take-branch instruction. But gcc certainly has not done so whenever I
looked at code it generated for PowerPC or IA-64.
Here is some data for code that focusses on indirect-branch
performance (with indirect branches that vary their targets), from <https://www.complang.tuwien.ac.at/forth/threading/>:
Numbers are cycles per indirect branch, smaller is faster, the years
are the release dates of the CPUs:
First, machines from the early 2000s:
sub- in- repl.
routine direct direct switch call switch CPU year
9.6 8.0 9.5 23.1 38.6 Alpha 21264B 800MHz ~2000
4.7 8.1 9.5 19.0 21.3 Pentium III 1000MHz 2000
18.4 8.5 10.3 24.5 29.0 Athlon 1200MHz 2000
8.6 14.2 15.3 23.4 30.2 Pentium 4 2.26 2002
13.3 10.3 12.3 15.7 18.7 Itanium 2 (McKinley) 900MHz 2002
5.7 9.2 12.3 16.3 17.9 PPC 7447A 1066MHz 2004
7.8 12.8 12.9 30.2 39.0 PPC 970 2000MHz 2002
Ignore the first column (it uses call and return), the others all need
an indirect branch or indirect call ("call" column) per dispatch, with varying amounts of other instructions; "direct" needs the least
instructions.
And here are results with some newer machines:
sub- in- repl.
routine direct direct switch call switch CPU year
4.9 5.6 4.3 5.1 7.64 Pentium M 755 2000MHz 2004
4.4 2.2 2.0 20.3 18.6 3.3 Xeon E3-1220 3100MHz 2011
4.0 2.3 2.3 4.0 5.1 3.5 Core i7-4790K 4400MHz 2013
4.2 2.1 2.0 4.9 5.2 2.7 Core i5-6600K 4000MHz 2015
5.7 3.2 3.9 7.0 8.6 3.7 Cortex-A73 1800MHz 2016
4.2 3.3 3.2 17.9 23.1 4.2 Ryzen 5 1600X 3600MHz 2017
6.9 24.5 27.3 37.1 33.5 36.6 Power9 3800MHz 2017
3.8 1.0 1.1 3.8 6.2 2.2 Core i5-1135G7 4200MHz 2020
The age of the Pentium M would suggest putting it into the earlier
table, but given its clear performance-per-clock advantage over the
other IA-32 and AMD64 CPUs of its day, it was probably the first CPU
to have a history-based indirect-branch predictor.
It seems that, while the AMD64 microarchitectures improved not just in
clock rate, but also in performance per clock for this microbenchmark
(thanks to history-based indirect-branch predictors), the Power 9
still relies on its split-branch architectural feature, resulting in slowness. And it's not just slowness in "direct", but the additional instructions in the other benchmarks add more cycles than in most
other CPUs.
Particularly notable is the Core i5-1135G7, which takes one indirect
branch per cycle.
I have to take additional measurements with other Power and AMD64
processors.
Couldn't the Power and IA-64 CPUs use history-based branch prediction,
too? Of course, but then it would be even more obvious that the
split-branch architecture provides no benefit.
Bottom line: History-based branch prediction has won, any kind of
delayed branches (including split-branch designs) turn out to be
a bad idea.
tell the CPU
what the target is (like VEC in My66000)
I have no idea what VEC does, but all indirect-branch architectures
are about telling the CPU what the target is.
just use a general
purpose register with a general-purpose instruction.
That turns out to be the winner.
One gotcha may be that the branch target needs to be predicted as it
cannot be calculated earlier in the pipeline.
If you want to be able to perform one taken branch per cycle (or
more), you always need prediction.
If you use a link register or a special instruction, the CPU could
do that.
It turns out that this does not work well in practice.
- anton--- Synchronet 3.21a-Linux NewsLink 1.2
Michael S <already5chosen@yahoo.com> writes:
On Tue, 04 Nov 2025 15:19:08 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Michael S <already5chosen@yahoo.com> writes:
On Mon, 03 Nov 2025 15:22:44 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?
I am pretty sure that by decimal FP Terje means decimal FP :-). As
defined in IEEE 754 (formerly it was in 854, but since 2008 it
became a part of the main standard).
IEEE 754 has two options for encoding of mantissa, IBM's DPD which is
a clever variation of Base 1000 and Intel's binary.
DPD encoding is considered preferable for hardware implementations
while binary encoding is easier for software implementations.
BCD is not an option, it's information density is insufficient to
supply required semantics in given size of container.
How so? The B3500 supported 100 digit (400 bit) signed mantissa and
a two digit signed exponent using a BCD representation.
What is not clear about 'in given size of container' ?
Semantics of IEEE Decimal128 call for 33 decimal digits + 1 binary bit
to be contained within 111 bits.
With BCD encoding one would need 133 bits.
I guess it wasn't clear that my question was regarding
the necessity of providing 'hidden' bits for BCD floating
point.
I still think the IBM DFP people did an impressively good job packing
that much data into a decimal representation. :-)
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Bottom line: History-based branch prediction has won, any kind of
delayed branches (including split-branch designs) turn out to be
a bad idea.
Or "Never bet against branch prediction".
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
I still think the IBM DFP people did an impressively good job packing
that much data into a decimal representation. :-)
Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
I still think the IBM DFP people did an impressively good job packing
that much data into a decimal representation. :-)
Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Thomas Koenig <tkoenig@netcologne.de> writes:
Should be possible. A question is if you want to have a special
register for that (like POWER's link register),
There is this idea of splitting an (indirect) branch into a
prepare-to-branch instruction and a take-branch instruction. The
I first heard about this 1982 from Burton Smith.
prepare-to-branch instruction announces the branch target to the CPU,
and Power's mtlr and mtctr are examples of that (somewhat muddled by
the fact that the ctr register can also be used for counted loops as
well as for indirect branches), and IA-64's branch-target registers
and the instructions that move there are another example. AFAIK SPARC
acquired something in this direction (touted as good for accelerating
Java) in the early 2000s. The take-branch instruction on Power is
blr/bctr.
I used to think that this kind of splitting is a good idea, and it is
certainly better than a branch-delay slot or a branch with a fixed
number of delay slots.
PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.
MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Bottom line: History-based branch prediction has won, any kind of
delayed branches (including split-branch designs) turn out to be
a bad idea.
Or "Never bet against branch prediction".
I have probably mentioned this before, once or twice, but I'm actually
quite proud of the meeting I had with Intel Santa Clara in the spring of 1995:
I had (accidentally) written the first public mention of the FDIV bug
(on comp.sys.intel) in Oct 1994, then together with Cleve Moler of MathWorks/MatLab fame led the effort to develop a minimum cost sw
workaround for the bug. (My code became part of all/most x86 compiler runtimes for the next few years.)
Due to this Intel invited me to receive an early engineering prototype
of the PentiumPro, together with an NDA-covered briefing about its architecture.
Before the start of that briefing I suggested that I should start off on
the blackboard by showing what I had been able to figure out on my own,
then I proceeded to pretty much exactly cover every single feature on
the cpu, with one glaring exception:
Based on the useful but not great branch predictor on the Pentium I told them that I expected the P6 to employ eager execution, i.e execute both
ways of one or two layers of branches, discarding the non-taken paths as
the branch direction info became available.
That's the point when they got to brag about how having a much, much
better branch predictor was better both from a performance and a power viewpoint, since out of order execution could predict much deeper than
any eager execution would have the resources for.
As you said: "Never bet against branch prediction".
Terje
On 11/4/2025 11:15 AM, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Thomas Koenig <tkoenig@netcologne.de> writes:
Should be possible. A question is if you want to have a special
register for that (like POWER's link register),
There is this idea of splitting an (indirect) branch into a
prepare-to-branch instruction and a take-branch instruction. The
I first heard about this 1982 from Burton Smith.
prepare-to-branch instruction announces the branch target to the CPU,
and Power's mtlr and mtctr are examples of that (somewhat muddled by
the fact that the ctr register can also be used for counted loops as
well as for indirect branches), and IA-64's branch-target registers
and the instructions that move there are another example. AFAIK SPARC
acquired something in this direction (touted as good for accelerating
Java) in the early 2000s. The take-branch instruction on Power is
blr/bctr.
I used to think that this kind of splitting is a good idea, and it is
certainly better than a branch-delay slot or a branch with a fixed
number of delay slots.
PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.
Wasn't this PL/1 feature "inherited" from the, now rightly deprecated, Alter/Goto in COBOL and Assigned GOTO in Fortran?
Thomas Koenig <tkoenig@netcologne.de> posted:
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
I still think the IBM DFP people did an impressively good job packing
that much data into a decimal representation. :-)
Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.
Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.
Other than "crap loads" of {deMorganizing and gate optimization}
that is essentially what HW actually does.
You still need to build 12-bit decimal ALUs to string together
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 11/4/2025 11:15 AM, MitchAlsup wrote:
PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.
Wasn't this PL/1 feature "inherited" from the, now rightly deprecated,
Alter/Goto in COBOL and Assigned GOTO in Fortran?
I also find it amusing that the backbone of modern software is
a static version of label variables -- we call them switch state-
ments.
But you can be sure COBOL got them from assembly language programmers.
Thomas Koenig <tkoenig@netcologne.de> posted:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
Terje Mathisen <terje.mathisen@tmsw.no> posted:
Actually, for the five required basic operations, you can always do the >>>> op in the next higher precision, then round again down to the target,
and get exactly the same result.
https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf
The PowerISA version 3.0 introduced rounding to odd for its 128-bit
floating point arithmetic, for that very reason (I assume).
Likely, My 66000 also has RNO and
Round Nearest Random is defined but not yet available
Round Away from Zero is also defined and available.
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 11/4/2025 11:15 AM, MitchAlsup wrote:
PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.
Wasn't this PL/1 feature "inherited" from the, now rightly deprecated,
Alter/Goto in COBOL and Assigned GOTO in Fortran?
Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler supports it
<https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>
Robert Finch <robfi680@gmail.com> schrieb:
Contemplating having conditional branch instructions branch to a target
value in a register instead of using a displacement.
I think this has about the same code density as having a branch to a
displacement from the IP.
Should be possible. A question is if you want to have a special
register for that (like POWER's link register), tell the CPU
what the target is (like VEC in My66000) or just use a general
purpose register with a general-purpose instruction.
Using a fused compare-and-branch instruction for Qupls4
Is that the name of your architecture, or an instruction? (That
may have been mentioned upthread, in that case I don't remember).
there is not
enough room in the instruction for a large branch displacement (10
bits). So, my thought is to branch to a register value instead.
There is already an add-to-instruction-pointer instruction that can be
used to generate relative addresses.
That makes sense.
By moving the register load outside of a loop, the dynamic instruction
count can be reduced. I think this solution is a bit better than having
compare and branch as two separate instructions, or having an extended
constant added to the branch instruction.
Are you talking about a normal loop condition or a jump out of
a loop?
One gotcha may be that the branch target needs to be predicted as it
cannot be calculated earlier in the pipeline.
If you use a link register or a special instruction, the CPU could
do that.
The 10-bit displacement format could also be supported, but it is yet
another branch instruction format. I may leave holes in the instruction
set for future support, but I think it is best to start with just a
single format.
Code:
AIPSI R3,1234 ; add displacement to IP and store in R3 (hoist-able) >> BLT R1,R2,R3 ; branch to R3 if R1 < R2
Versus:
CMP R3,R1,R2
BLT R3,displacement
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 11/4/2025 11:15 AM, MitchAlsup wrote:
PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.
Wasn't this PL/1 feature "inherited" from the, now rightly deprecated,
Alter/Goto in COBOL and Assigned GOTO in Fortran?
Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler supports it <https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>
What makes you think that it is "rightly" to deprecate or delete this feature?
<https://riptutorial.com/fortran/example/11872/assigned-goto> says:
|It can be avoided in modern code by using procedures, internal
|procedures, procedure pointers and other features.
I know no feature in Fortran or standard C which replaces my use of labels-as-values, the GNU C equivalent of the assigned goto. If you
look at <https://www.complang.tuwien.ac.at/forth/threading/>, "direct"
and "indirect" use labels-as-values, whereas "switch", "call" and
"repl. switch" use standard C features (switch, indirect calls, and switch+goto respectively). "direct" and "indirect" usually outperform
these others, sometimes by a lot.
I also find it amusing that the backbone of modern software is
a static version of label variables -- we call them switch state-
ments.
I am not sure if it's "the" backbone. Fortran has (had?) a feature
called "computed goto" that's closer to C's switch than "assigned
goto".
Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
supports it >><https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>
That is the problem with deleted features - compiler writers have
to support them forever, and interaction with other features can
lead to problems.
MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Bottom line: History-based branch prediction has won, any kind of
delayed branches (including split-branch designs) turn out to be
a bad idea.
Or "Never bet against branch prediction".
I have probably mentioned this before, once or twice, but I'm actually
quite proud of the meeting I had with Intel Santa Clara in the spring of 1995:
I had (accidentally) written the first public mention of the FDIV bug
(on comp.sys.intel) in Oct 1994, then together with Cleve Moler of MathWorks/MatLab fame led the effort to develop a minimum cost sw
workaround for the bug. (My code became part of all/most x86 compiler runtimes for the next few years.)
Due to this Intel invited me to receive an early engineering prototype
of the PentiumPro, together with an NDA-covered briefing about its architecture.
Before the start of that briefing I suggested that I should start off on
the blackboard by showing what I had been able to figure out on my own,
then I proceeded to pretty much exactly cover every single feature on
the cpu, with one glaring exception:
Based on the useful but not great branch predictor on the Pentium I told them that I expected the P6 to employ eager execution, i.e execute both
ways of one or two layers of branches, discarding the non-taken paths as
the branch direction info became available.
That's the point when they got to brag about how having a much, much
better branch predictor was better both from a performance and a power viewpoint, since out of order execution could predict much deeper than
any eager execution would have the resources for.
As you said: "Never bet against branch prediction".
Terje
On 2025-11-03 1:47 p.m., Thomas Koenig wrote:
Robert Finch <robfi680@gmail.com> schrieb:
Contemplating having conditional branch instructions branch to a target
value in a register instead of using a displacement.
I think this has about the same code density as having a branch to a
displacement from the IP.
Should be possible. A question is if you want to have a special
register for that (like POWER's link register), tell the CPU
what the target is (like VEC in My66000) or just use a general
purpose register with a general-purpose instruction.
Using a fused compare-and-branch instruction for Qupls4
Is that the name of your architecture, or an instruction? (That
may have been mentioned upthread, in that case I don't remember).
That was the name of the architecture, but I am being fickle and
scrapping it, restarting with the Qupls2024 architecture innovated to Qupls2026.
there is not
enough room in the instruction for a large branch displacement (10
bits). So, my thought is to branch to a register value instead.
There is already an add-to-instruction-pointer instruction that can be
used to generate relative addresses.
That makes sense.
Using 48-bit instructions now, so there is enough room for an 18-bit displacement. Still having branch to register as well.>
Any loop condition that needs a displacement constant. The constantBy moving the register load outside of a loop, the dynamic instruction
count can be reduced. I think this solution is a bit better than having
compare and branch as two separate instructions, or having an extended
constant added to the branch instruction.
Are you talking about a normal loop condition or a jump out of
a loop?
being loaded into a register.
One gotcha may be that the branch target needs to be predicted as it
cannot be calculated earlier in the pipeline.
If you use a link register or a special instruction, the CPU could
do that.
The 10-bit displacement format could also be supported, but it is yet
another branch instruction format. I may leave holes in the instruction
set for future support, but I think it is best to start with just a
single format.
Code:
AIPSI R3,1234 ; add displacement to IP and store in R3 (hoist-able) >>> BLT R1,R2,R3 ; branch to R3 if R1 < R2
Versus:
CMP R3,R1,R2
BLT R3,displacement
Thomas Koenig <tkoenig@netcologne.de> posted:
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
I still think the IBM DFP people did an impressively good job packing
that much data into a decimal representation. :-)
Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.
Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.
On 11/4/2025 3:44 PM, Terje Mathisen wrote:
MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Bottom line: History-based branch prediction has won, any kind of
delayed branches (including split-branch designs) turn out to be
a bad idea.
Or "Never bet against branch prediction".
I have probably mentioned this before, once or twice, but I'm actually
quite proud of the meeting I had with Intel Santa Clara in the spring
of 1995:
I had (accidentally) written the first public mention of the FDIV bug
(on comp.sys.intel) in Oct 1994, then together with Cleve Moler of
MathWorks/MatLab fame led the effort to develop a minimum cost sw
workaround for the bug. (My code became part of all/most x86 compiler
runtimes for the next few years.)
Due to this Intel invited me to receive an early engineering prototype
of the PentiumPro, together with an NDA-covered briefing about its
architecture.
Before the start of that briefing I suggested that I should start off
on the blackboard by showing what I had been able to figure out on my
own, then I proceeded to pretty much exactly cover every single
feature on the cpu, with one glaring exception:
Based on the useful but not great branch predictor on the Pentium I
told them that I expected the P6 to employ eager execution, i.e
execute both ways of one or two layers of branches, discarding the
non-taken paths as the branch direction info became available.
That's the point when they got to brag about how having a much, much
better branch predictor was better both from a performance and a power
viewpoint, since out of order execution could predict much deeper than
any eager execution would have the resources for.
As you said: "Never bet against branch prediction".
Branch prediction is fun.
When I looked around online before, a lot of stuff about branch
prediction was talking about fairly large and convoluted schemes for the branch predictors.
But, then always at the end of it using 2-bit saturating counters:
weakly taken, weakly not-taken, strongly taken, strongly not taken.
But, in my fiddling, there was seemingly a simple but moderately
effective strategy:
Keep a local history of taken/not-taken;
XOR this with the low-order-bits of PC for the table index;
Use a 5/6-bit finite-state-machine or similar.
Can model repeating patterns up to ~ 4 bits.
Where, the idea was that the state-machine in updated with the current
state and branch direction, giving the next state and next predicted
branch direction (for this state).
Could model slightly more complex patterns than the 2-bit saturating counters, but it is sort of a partial mystery why (for mainstream processors) more complex lookup schemes and 2 bit state, was preferable
to a simpler lookup scheme and 5-bit state.
Well, apart from the relative "dark arts" needed to cram 4-bit patterns
into a 5 bit FSM (is a bit easier if limiting the patterns to 3 bits).
Then again, had before noted that the LLMs are seemingly also not really able to figure out how to make a 5 bit FSM to model a full set of 4 bit patterns.
Then again, I wouldn't expect it to be all that difficult of a problem
for someone that is "actually smart"; so presumably chip designers could have done similar.
Well, unless maybe the argument is that 5 or 6 bits of storage would
cost more than 2 bits, but then presumably needing to have significantly larger tables (to compensate for the relative predictive weakness of 2-
bit state) would have costed more than the cost of smaller tables of 6
bit state ?...
Say, for example, 2b:
00_0 => 10_0 //Weakly not-taken, dir=0, goes strong not-taken
00_1 => 01_0 //Weakly not-taken, dir=1, goes weakly taken
01_0 => 00_1 //Weakly taken, dir=0, goes weakly not-taken
01_1 => 11_1 //Weakly taken, dir=1, goes strongly taken
10_0 => 10_0 //strongly not taken, dir=0
10_1 => 00_0 //strongly not taken, dir=1 (goes weak)
11_0 => 01_1 //strongly taken, dir=0
11_1 => 11_1 //strongly taken, dir=1 (goes weak)
Can expand it to 3-bits, for 2-bit patterns
As above, and 4-more alternating states
And slightly different transition logic.
Say (abbreviated):
000 weak, not taken
001 weak, taken
010 strong, not taken
011 strong, taken
100 weak, alternating, not-taken
101 weak, alternating, taken
110 strong, alternating, not-taken
111 strong, alternating, taken
The alternating states just flip-flopping between taken and not taken.
The weak states can more between any of the 4.
The strong states used if the pattern is reinforced.
Going up to 3 bit patterns is more of the same (add another bit,
doubling the number of states). Seemingly something goes nasty when
getting to 4 bit patterns though (and can't fit both weak and strong
states for longer patterns, so the 4b patterns effectively only exist as weak states which partly overlap with the weak states for the 3-bit patterns).
But, yeah, not going to type out state tables for these ones.
Not proven, but I suspect that an arbitrary 5 bit pattern within a 6 bit state might be impossible. Although there would be sufficient state-
space for the looping 5-bit patterns, there may not be sufficient state- space to distinguish whether to move from a mismatched 4-bit pattern to
a 3 or 5 bit pattern. Whereas, at least with 4-bit, any mismatch of the 4-bit pattern can always decay to a 3-bit pattern, etc. One needs to be
able to express decay both to shorter patterns and to longer patterns,
and I suspect at this point, the pattern breaks down (but can't easily confirm; it is either this or the pattern extends indefinitely, I don't know...).
Could almost have this sort of thing as a "brain teaser" puzzle or something...
Then again, maybe other people would not find any particular difficulty
in these sorts of tasks.
Terje
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 11/4/2025 11:15 AM, MitchAlsup wrote:
PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.
Wasn't this PL/1 feature "inherited" from the, now rightly deprecated,
Alter/Goto in COBOL and Assigned GOTO in Fortran?
Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler supports it <https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>
What makes you think that it is "rightly" to deprecate or delete this feature?
<https://riptutorial.com/fortran/example/11872/assigned-goto> says:
|It can be avoided in modern code by using procedures, internal
|procedures, procedure pointers and other features.
I know no feature in Fortran or standard C which replaces my use of labels-as-values, the GNU C equivalent of the assigned goto. If you
look at <https://www.complang.tuwien.ac.at/forth/threading/>, "direct"
and "indirect" use labels-as-values, whereas "switch", "call" and
"repl. switch" use standard C features (switch, indirect calls, and switch+goto respectively). "direct" and "indirect" usually outperform
these others, sometimes by a lot.
I also find it amusing that the backbone of modern software is
a static version of label variables -- we call them switch state-
ments.
I am not sure if it's "the" backbone. Fortran has (had?) a feature
called "computed goto" that's closer to C's switch than "assigned
goto". Ironically, the gcc people usually call their labels-as-values feature "computed goto" rather than "labels as values" or "assigned
goto".
But you can be sure COBOL got them from assembly language programmers.
Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer
architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.
Why does standard C not have it? C had it up to and including the 6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
between 6th and 7th edition. Ritchie wrote <37178013.A1EE3D4F@bell-labs.com>:
| I eliminated them because I didn't know what to say about their
| semantics.
Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.
- anton
For the Intel binary mantissa dfp128 normalization is the hard issue, Michael S have figured out some really nice tricks to speed it up,
but when you have a (worst case) temporary 220+ bit product mantissa, scaling is not that easy.
Thomas Koenig <tkoenig@netcologne.de> posted:
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
I still think the IBM DFP people did an impressively good job
packing that much data into a decimal representation. :-)
Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.
Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.
Other than "crap loads" of {deMorganizing and gate optimization}
that is essentially what HW actually does.
You still need to build 12-bit decimal ALUs to string together
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
Thomas Koenig <tkoenig@netcologne.de> posted:
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
I still think the IBM DFP people did an impressively good job packing
that much data into a decimal representation. :-)
Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.
Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.
I played around with the formulas from the POWER manual a bit,
using Berkeley abc for logic optimization, for the conversion
of the packed modulo 1000 to three BCD digits.
Without spending too much effort, I arrived at four gate delays
(INV -> OAI21 -> NAND2 -> NAND2) with a total of 37 gates optimizing
for speed, or five gate delays optimizing for space.
I strongly suspect that IBM is doing something similar :-)
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 11/4/2025 11:15 AM, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Thomas Koenig <tkoenig@netcologne.de> writes:
Should be possible. A question is if you want to have a special
register for that (like POWER's link register),
There is this idea of splitting an (indirect) branch into a
prepare-to-branch instruction and a take-branch instruction. The
I first heard about this 1982 from Burton Smith.
prepare-to-branch instruction announces the branch target to the CPU,
and Power's mtlr and mtctr are examples of that (somewhat muddled by
the fact that the ctr register can also be used for counted loops as
well as for indirect branches), and IA-64's branch-target registers
and the instructions that move there are another example. AFAIK SPARC >>>> acquired something in this direction (touted as good for accelerating
Java) in the early 2000s. The take-branch instruction on Power is
blr/bctr.
I used to think that this kind of splitting is a good idea, and it is
certainly better than a branch-delay slot or a branch with a fixed
number of delay slots.
PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.
Wasn't this PL/1 feature "inherited" from the, now rightly deprecated,
Alter/Goto in COBOL and Assigned GOTO in Fortran?
Probably.
I find it somewhat amusing that modern languages moved away from
label variables and into method calls -- which if you look at it
from 5,000 feet/metres -- is just a more expensive "label".
I also find it amusing that the backbone of modern software is
a static version of label variables -- we call them switch state-
ments.
But you can be sure COBOL got them from assembly language programmers.
On Tue, 4 Nov 2025 22:52:46 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
For the Intel binary mantissa dfp128 normalization is the hard issue,
Michael S have figured out some really nice tricks to speed it up,
I remember that I played with that, but don't remember what I did
exactly. I dimly recollect that the fastest solution was relatively straight-forward. It was trying to minimize the length of dependency
chains rather than total number of multiplications.
An important point here is that I played on relatively old x86-64
hardware. My solution is not necessarily optimal for newer hardware.
The differences between old and new are two-fold and they push
optimal solution into different directions.
1. Increase in throughput of integer multiplier
2. Decrease in latency of integer division
The first factor suggests even more intense push toward "eager"
solutions.
The second factor suggests, possibly, much simpler code, especially in
common case of division by 1 to 27 decimal digits (5**27 < 2**64).
How they say? Sometimes a division is just a division.
Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer
architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.
Why does standard C not have it? C had it up to and including the 6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
between 6th and 7th edition. Ritchie wrote <37178013.A1EE3D4F@bell-labs.com>:
| I eliminated them because I didn't know what to say about their
| semantics.
Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.
Thomas Koenig <tkoenig@netcologne.de> writes:
Assigned GOTO has been deleted from the Fortran standard (in FortranThat is the problem with deleted features - compiler writers have
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
supports it
<https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>
to support them forever, and interaction with other features can
lead to problems.
So does gfortran support assigned goto, too? What problems in
interaction with other features do you see?
- anton
On Tue, 04 Nov 2025 22:51:28 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Thomas Koenig <tkoenig@netcologne.de> posted:
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
I still think the IBM DFP people did an impressively good job
packing that much data into a decimal representation. :-)
Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.
Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.
Other than "crap loads" of {deMorganizing and gate optimization}
that is essentially what HW actually does.
You still need to build 12-bit decimal ALUs to string together
Are talking about hardware or software?
On 2025-11-05 7:17, Anton Ertl wrote:
[ snip ]
Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer
architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.
Why does standard C not have it? C had it up to and including the 6th
edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
between 6th and 7th edition. Ritchie wrote
<37178013.A1EE3D4F@bell-labs.com>:
| I eliminated them because I didn't know what to say about their
| semantics.
Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.
I don't know what Stallman said, or would have said if asked, but I
guess something like "the semantics is a jump to the (address of the)
label to which the value refers", which is machine-level semantics and
not semantics in the abstract C machine.
The problem in the abstract C machine is a "goto label-value" statement where the label-value refers to a label in a different function. Does
gcc prevent that at compile time? If not, I would expect the semantics
to be Undefined Behavior, the usual cop-out when nothing useful can be
said.
(In an earlier discussion on this group, some years ago, I explained how labels-as-values could be added to Ada, using the type system to ensure
safe and defined semantics. But I don't think such an extension would be accepted for the Ada standard.)
Niklas
On 11/5/2025 9:26 AM, Niklas Holsti wrote:
On 2025-11-05 7:17, Anton Ertl wrote:
My guess here:
It is an "oh crap" situation and program either immediately or (maybe
not as immediately) explodes...
Otherwise, it would need to function more like a longjmp, which would
mean that it would likely be painfully slow.
Thomas Koenig <tkoenig@netcologne.de> writes:
Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
supports it >>><https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>
That is the problem with deleted features - compiler writers have
to support them forever, and interaction with other features can
lead to problems.
So does gfortran support assigned goto, too?
What problems in
interaction with other features do you see?
On 11/5/2025 9:26 AM, Niklas Holsti wrote:
On 2025-11-05 7:17, Anton Ertl wrote:
[ snip ]
Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer
architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.
Why does standard C not have it? C had it up to and including the 6th
edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
between 6th and 7th edition. Ritchie wrote
<37178013.A1EE3D4F@bell-labs.com>:
| I eliminated them because I didn't know what to say about their
| semantics.
Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.
I don't know what Stallman said, or would have said if asked, but I
guess something like "the semantics is a jump to the (address of the)
label to which the value refers", which is machine-level semantics and
not semantics in the abstract C machine.
The problem in the abstract C machine is a "goto label-value"
statement where the label-value refers to a label in a different
function. Does gcc prevent that at compile time? If not, I would
expect the semantics to be Undefined Behavior, the usual cop-out when
nothing useful can be said.
(In an earlier discussion on this group, some years ago, I explained
how labels-as-values could be added to Ada, using the type system to
ensure safe and defined semantics. But I don't think such an extension
would be accepted for the Ada standard.)
My guess here:
It is an "oh crap" situation and program either immediately or (maybe
not as immediately) explodes...
Otherwise, it would need to function more like a longjmp, which would
mean that it would likely be painfully slow.
On 2025-11-03 2:03 p.m., MitchAlsup wrote:
Thomas Koenig <tkoenig@netcologne.de> posted:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
Terje Mathisen <terje.mathisen@tmsw.no> posted:
Actually, for the five required basic operations, you can always do the >>>> op in the next higher precision, then round again down to the target, >>>> and get exactly the same result.
https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf
The PowerISA version 3.0 introduced rounding to odd for its 128-bit
floating point arithmetic, for that very reason (I assume).
Likely, My 66000 also has RNO and
Round Nearest Random is defined but not yet available
Round Away from Zero is also defined and available.
Round nearest random?
How about round externally guided (RXG) by an
input signal?
For instance, the rounding could come from a feedback
filter of some sort.
On 11/4/2025 3:44 PM, Terje Mathisen wrote:---------------
MitchAlsup wrote:
As you said: "Never bet against branch prediction".
Branch prediction is fun.
When I looked around online before, a lot of stuff about branch
prediction was talking about fairly large and convoluted schemes for the branch predictors.
But, then always at the end of it using 2-bit saturating counters:
weakly taken, weakly not-taken, strongly taken, strongly not taken.
But, in my fiddling, there was seemingly a simple but moderately
effective strategy:
Keep a local history of taken/not-taken;
XOR this with the low-order-bits of PC for the table index;
Use a 5/6-bit finite-state-machine or similar.
Can model repeating patterns up to ~ 4 bits.
Where, the idea was that the state-machine in updated with the current
state and branch direction, giving the next state and next predicted
branch direction (for this state).
Could model slightly more complex patterns than the 2-bit saturating counters, but it is sort of a partial mystery why (for mainstream processors) more complex lookup schemes and 2 bit state, was preferable
to a simpler lookup scheme and 5-bit state.
Not proven, but I suspect that an arbitrary 5 bit pattern within a 6 bit state might be impossible. Although there would be sufficient
state-space for the looping 5-bit patterns, there may not be sufficient state-space to distinguish whether to move from a mismatched 4-bit
pattern to a 3 or 5 bit pattern. Whereas, at least with 4-bit, any
mismatch of the 4-bit pattern can always decay to a 3-bit pattern, etc.
One needs to be able to express decay both to shorter patterns and to
longer patterns, and I suspect at this point, the pattern breaks down
(but can't easily confirm; it is either this or the pattern extends indefinitely, I don't know...).
On 2025-11-05 1:47 a.m., Robert Finch wrote:-----------
I am now modifying Qupls2024 into Qupls2026 rather than starting a completely new ISA. The big difference is Qupls2024 uses 64-bit
instructions and Qupls2026 uses 48-bit instructions making the code 25%
more compact with no real loss of operations.
Qupls2024 also used 8-bit register specs. This was a bit of overkill and
not really needed. Register specs are reduced to 6-bits. Right-away that reduced most instructions eight bits.
I decided I liked the dual operations that some instructions supported, which need a wide instruction format.
One gotcha is that 64-bit constant overrides need to be modified. For Qupls2024 a 64-bit constant override could be specified using only a
single additional instruction word. This is not possible with 48-bit instruction words. Qupls2024 only allowed a single additional constant
word. I may maintain this for Qupls2026, but that means that a max
constant override of 48-bits would be supported. A 64-bit constant can
still be built up in a register using the add-immediate with shift instruction. It is ugly and takes about three instructions.
I could reduce the 64-bit constant build to two instructions by adding a load-immediate instruction.
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
Thomas Koenig <tkoenig@netcologne.de> posted:
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
I still think the IBM DFP people did an impressively good job packing >> > that much data into a decimal representation. :-)
Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.
Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.
I played around with the formulas from the POWER manual a bit,
using Berkeley abc for logic optimization, for the conversion
of the packed modulo 1000 to three BCD digits.
Without spending too much effort, I arrived at four gate delays
(INV -> OAI21 -> NAND2 -> NAND2) with a total of 37 gates optimizing
for speed, or five gate delays optimizing for space.
I strongly suspect that IBM is doing something similar :-)--- Synchronet 3.21a-Linux NewsLink 1.2
On 11/4/2025 11:17 PM, Anton Ertl wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 11/4/2025 11:15 AM, MitchAlsup wrote:
PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.
Wasn't this PL/1 feature "inherited" from the, now rightly deprecated, >>> Alter/Goto in COBOL and Assigned GOTO in Fortran?
Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler supports it <https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>
What makes you think that it is "rightly" to deprecate or delete this feature?
<https://riptutorial.com/fortran/example/11872/assigned-goto> says:
|It can be avoided in modern code by using procedures, internal |procedures, procedure pointers and other features.
I know no feature in Fortran or standard C which replaces my use of labels-as-values, the GNU C equivalent of the assigned goto. If you
look at <https://www.complang.tuwien.ac.at/forth/threading/>, "direct"
and "indirect" use labels-as-values, whereas "switch", "call" and
"repl. switch" use standard C features (switch, indirect calls, and switch+goto respectively). "direct" and "indirect" usually outperform these others, sometimes by a lot.
I usually used call threading, because:
In my testing it was one of the faster options;
At least if excluding 32-bit x86,
which often has slow function calls.
Because pretty much every function needs a stack frame, ...
It is usable in standard C.
Often "while loop and switch()" was notably slower than using unrolled
lists of indirect function calls (usually with the main dispatch loop
based on "traces", which would call each of the opcode functions and
then return the next trace to be run).
Granted, "while loop and switch" is the more traditional way of writing
an interpreter.
On Tue, 04 Nov 2025 22:51:28 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Thomas Koenig <tkoenig@netcologne.de> posted:
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
I still think the IBM DFP people did an impressively good job
packing that much data into a decimal representation. :-)
Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.
Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.
Other than "crap loads" of {deMorganizing and gate optimization}
that is essentially what HW actually does.
You still need to build 12-bit decimal ALUs to string together
Are talking about hardware or software?
Qupls2026 currently supports 48-bit inline constants. I am debating
whether to support 89 and 130-bit inline constants as well. Constant
sizes increase by 41-bits due to the 48-bit instruction word size. The larger constants would require more instruction words to be available to
be processed in decode. Not sure if it is even possible to pass a
constant larger than 64-bits in the machine.
I just realized that constant operand routing was already in Qupls, I
had just not specifically identified it. The operand routing bits are
just moved into a postfix instruction word rather than the first
instruction word. This gives more bits available in the instruction
word. Rather than burn a couple of bits in every R3 type instruction, another couple of opcodes are used to represent constant extensions.
On 2025-11-05 7:17, Anton Ertl wrote:
[ snip ]
Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.
Why does standard C not have it? C had it up to and including the 6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
between 6th and 7th edition. Ritchie wrote <37178013.A1EE3D4F@bell-labs.com>:
| I eliminated them because I didn't know what to say about their
| semantics.
Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.
I don't know what Stallman said, or would have said if asked, but I
guess something like "the semantics is a jump to the (address of the)
label to which the value refers", which is machine-level semantics and
not semantics in the abstract C machine.
The problem in the abstract C machine is a "goto label-value" statement where the label-value refers to a label in a different function. Does
gcc prevent that at compile time?
If not, I would expect the semantics--- Synchronet 3.21a-Linux NewsLink 1.2
to be Undefined Behavior, the usual cop-out when nothing useful can be said.
(In an earlier discussion on this group, some years ago, I explained how labels-as-values could be added to Ada, using the type system to ensure
safe and defined semantics. But I don't think such an extension would be accepted for the Ada standard.)
Niklas
On 2025-11-05 18:23, BGB wrote:
On 11/5/2025 9:26 AM, Niklas Holsti wrote:
On 2025-11-05 7:17, Anton Ertl wrote:
[ snip ]
Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer
architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.
Why does standard C not have it? C had it up to and including the 6th >>> edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
between 6th and 7th edition. Ritchie wrote
<37178013.A1EE3D4F@bell-labs.com>:
| I eliminated them because I didn't know what to say about their
| semantics.
Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.
I don't know what Stallman said, or would have said if asked, but I
guess something like "the semantics is a jump to the (address of the)
label to which the value refers", which is machine-level semantics and
not semantics in the abstract C machine.
The problem in the abstract C machine is a "goto label-value"
statement where the label-value refers to a label in a different
function. Does gcc prevent that at compile time? If not, I would
expect the semantics to be Undefined Behavior, the usual cop-out when
nothing useful can be said.
(In an earlier discussion on this group, some years ago, I explained
how labels-as-values could be added to Ada, using the type system to
ensure safe and defined semantics. But I don't think such an extension
would be accepted for the Ada standard.)
My guess here:
It is an "oh crap" situation and program either immediately or (maybe
not as immediately) explodes...
Or silently produces wrong results.
Otherwise, it would need to function more like a longjmp, which would
mean that it would likely be painfully slow.
But then you could get the problem of a longjmp to a setjmp value that
is stale because the targeted function invocation (stack frame) is no
longer there.
Niklas
Niklas Holsti <niklas.holsti@tidorum.invalid> posted:
On 2025-11-05 18:23, BGB wrote:
On 11/5/2025 9:26 AM, Niklas Holsti wrote:
On 2025-11-05 7:17, Anton Ertl wrote:
[ snip ]
Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer
architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.
Why does standard C not have it? C had it up to and including the 6th >>>>> edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
between 6th and 7th edition. Ritchie wrote
<37178013.A1EE3D4F@bell-labs.com>:
| I eliminated them because I didn't know what to say about their
| semantics.
Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.
I don't know what Stallman said, or would have said if asked, but I
guess something like "the semantics is a jump to the (address of the)
label to which the value refers", which is machine-level semantics and >>>> not semantics in the abstract C machine.
The problem in the abstract C machine is a "goto label-value"
statement where the label-value refers to a label in a different
function. Does gcc prevent that at compile time? If not, I would
expect the semantics to be Undefined Behavior, the usual cop-out when
nothing useful can be said.
(In an earlier discussion on this group, some years ago, I explained
how labels-as-values could be added to Ada, using the type system to
ensure safe and defined semantics. But I don't think such an extension >>>> would be accepted for the Ada standard.)
My guess here:
It is an "oh crap" situation and program either immediately or (maybe
not as immediately) explodes...
Or silently produces wrong results.
Otherwise, it would need to function more like a longjmp, which would
mean that it would likely be painfully slow.
But then you could get the problem of a longjmp to a setjmp value that
is stale because the targeted function invocation (stack frame) is no
longer there.
But YOU had to pass the jumpbuf out of the setjump() scope.
Now, YOU complain there is a hole in your own foot with a smoking gun
in your own hand.
Robert Finch <robfi680@gmail.com> posted:
On 2025-11-05 1:47 a.m., Robert Finch wrote:-----------
I am now modifying Qupls2024 into Qupls2026 rather than starting a
completely new ISA. The big difference is Qupls2024 uses 64-bit
instructions and Qupls2026 uses 48-bit instructions making the code 25%
more compact with no real loss of operations.
Qupls2024 also used 8-bit register specs. This was a bit of overkill and
not really needed. Register specs are reduced to 6-bits. Right-away that
reduced most instructions eight bits.
4 register specifiers: check.
I decided I liked the dual operations that some instructions supported,
which need a wide instruction format.
With 48-bits, if you can get 2 instructions 50% of the time, you are only
12% bigger than a 32-bit ISA.
One gotcha is that 64-bit constant overrides need to be modified. For
Qupls2024 a 64-bit constant override could be specified using only a
single additional instruction word. This is not possible with 48-bit
instruction words. Qupls2024 only allowed a single additional constant
word. I may maintain this for Qupls2026, but that means that a max
constant override of 48-bits would be supported. A 64-bit constant can
still be built up in a register using the add-immediate with shift
instruction. It is ugly and takes about three instructions.
It was that sticking problem of constants that drove most of My 66000
ISA style--variable length and how to encode access to these constants
and routing thereof.
Motto: never execute any instructions fetching or building constants.
I could reduce the 64-bit constant build to two instructions by adding a
load-immediate instruction.
May I humbly suggest this is the wrong direction.
Robert Finch <robfi680@gmail.com> posted:
Qupls2026 currently supports 48-bit inline constants. I am debatingMy 66000 ISA has OpCodes in the range {I.major >= 8 && I.major < 24}
whether to support 89 and 130-bit inline constants as well. Constant
sizes increase by 41-bits due to the 48-bit instruction word size. The
larger constants would require more instruction words to be available to
be processed in decode. Not sure if it is even possible to pass a
constant larger than 64-bits in the machine.
I just realized that constant operand routing was already in Qupls, I
had just not specifically identified it. The operand routing bits are
just moved into a postfix instruction word rather than the first
instruction word. This gives more bits available in the instruction
word. Rather than burn a couple of bits in every R3 type instruction,
another couple of opcodes are used to represent constant extensions.
that can supply constants and perform operand routing. Within this
range; instruction<8:5> specify the following table:
0 0 0 0 +Src1 +Src2
0 0 0 1 +Src1 -Src2
0 0 1 0 -Src1 +Src2
0 0 1 1 -Src1 -Src2
0 1 0 0 +Src1 +imm5
0 1 0 1 +Imm5 +Src2
0 1 1 0 -Src1 -Imm5
0 1 1 1 +Imm5 -Src2
1 0 0 0 +Src1 Imm32
1 0 0 1 Imm32 +Src2
1 0 1 0 -Src1 Imm32
1 0 1 1 Imm32 -Src2
1 1 0 0 +Src1 Imm64
1 1 0 1 Imm64 +Src2
1 1 1 0 -Src1 Imm64
1 1 1 1 Imm64 -Src2
Here we have access to {5, 32, 64}-bit constants, 16-bit constantsI just realized that Qupls2026 does not accommodate small constants very
come from different OpCodes.
Imm5 are the register specifier bits: range {-31..31} for integer and logical, range {-15.5..15.5} for floating point.
Robert Finch <robfi680@gmail.com> posted:
Qupls2026 currently supports 48-bit inline constants. I am debatingMy 66000 ISA has OpCodes in the range {I.major >= 8 && I.major < 24}
whether to support 89 and 130-bit inline constants as well. Constant
sizes increase by 41-bits due to the 48-bit instruction word size. The
larger constants would require more instruction words to be available to
be processed in decode. Not sure if it is even possible to pass a
constant larger than 64-bits in the machine.
I just realized that constant operand routing was already in Qupls, I
had just not specifically identified it. The operand routing bits are
just moved into a postfix instruction word rather than the first
instruction word. This gives more bits available in the instruction
word. Rather than burn a couple of bits in every R3 type instruction,
another couple of opcodes are used to represent constant extensions.
that can supply constants and perform operand routing. Within this
range; instruction<8:5> specify the following table:
0 0 0 0 +Src1 +Src2
0 0 0 1 +Src1 -Src2
0 0 1 0 -Src1 +Src2
0 0 1 1 -Src1 -Src2
0 1 0 0 +Src1 +imm5
0 1 0 1 +Imm5 +Src2
0 1 1 0 -Src1 -Imm5
0 1 1 1 +Imm5 -Src2
1 0 0 0 +Src1 Imm32
1 0 0 1 Imm32 +Src2
1 0 1 0 -Src1 Imm32
1 0 1 1 Imm32 -Src2
1 1 0 0 +Src1 Imm64
1 1 0 1 Imm64 +Src2
1 1 1 0 -Src1 Imm64
1 1 1 1 Imm64 -Src2
Here we have access to {5, 32, 64}-bit constants, 16-bit constants
come from different OpCodes.
Imm5 are the register specifier bits: range {-31..31} for integer and logical, range {-15.5..15.5} for floating point.
Michael S <already5chosen@yahoo.com> posted:
On Tue, 04 Nov 2025 22:51:28 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Thomas Koenig <tkoenig@netcologne.de> posted:
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
I still think the IBM DFP people did an impressively good job
packing that much data into a decimal representation. :-)
Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.
Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.
Other than "crap loads" of {deMorganizing and gate optimization}
that is essentially what HW actually does.
You still need to build 12-bit decimal ALUs to string together
Are talking about hardware or software?A SW solution based on how it would be done in HW.
On 11/4/2025 9:17 PM, Anton Ertl wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 11/4/2025 11:15 AM, MitchAlsup wrote:
PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.
Wasn't this PL/1 feature "inherited" from the, now rightly deprecated, >>>> Alter/Goto in COBOL and Assigned GOTO in Fortran?
Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
supports it
<https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>
What makes you think that it is "rightly" to deprecate or delete this
feature?
Because it could, and often did, make the code "unfollowable". That is,
you are reading the code, following it to try to figure out what it is
doing and come to an assigned/alter goto, and you don't know where to go >next. The value was set some place else in the code, who knows where,
and thus what value it was set to, and people/programmers just aren't
used to being able to follow code like that.
BTW, you mentioned that it could be implemented as an indirect jump. It >could for those architectures that supported that feature, but it could
also be implemented by having the Alter/Assign modify the code (i.e.
change the address in the jump/branch instruction), and self modifying
code is just bad.
As did COBOL, called goto depending on, but those features didn't suffer
the problems of assigned/alter gotos.
On 2025-11-05 7:17, Anton Ertl wrote:
[ snip ]
Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.
Why does standard C not have it? C had it up to and including the
6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away between 6th and 7th edition. Ritchie wrote <37178013.A1EE3D4F@bell-labs.com>:
| I eliminated them because I didn't know what to say about their
| semantics.
Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.
I don't know what Stallman said, or would have said if asked, but I
guess something like "the semantics is a jump to the (address of the)
label to which the value refers", which is machine-level semantics
and not semantics in the abstract C machine.
The problem in the abstract C machine is a "goto label-value"
statement where the label-value refers to a label in a different
function. Does gcc prevent that at compile time? If not, I would
expect the semantics to be Undefined Behavior, the usual cop-out when
nothing useful can be said.
(In an earlier discussion on this group, some years ago, I explained
how labels-as-values could be added to Ada, using the type system to
ensure safe and defined semantics. But I don't think such an
extension would be accepted for the Ada standard.)
Niklas
On Wed, 5 Nov 2025 17:26:44 +0200
Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:
On 2025-11-05 7:17, Anton Ertl wrote:
[ snip ]
Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer
architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.
Why does standard C not have it? C had it up to and including the
6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
between 6th and 7th edition. Ritchie wrote
<37178013.A1EE3D4F@bell-labs.com>:
| I eliminated them because I didn't know what to say about their
| semantics.
Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.
I don't know what Stallman said, or would have said if asked, but I
guess something like "the semantics is a jump to the (address of the)
label to which the value refers", which is machine-level semantics
and not semantics in the abstract C machine.
The problem in the abstract C machine is a "goto label-value"
statement where the label-value refers to a label in a different
function. Does gcc prevent that at compile time? If not, I would
expect the semantics to be Undefined Behavior, the usual cop-out when
nothing useful can be said.
Yes, UB sounnds as the best answer..
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 11/4/2025 9:17 PM, Anton Ertl wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 11/4/2025 11:15 AM, MitchAlsup wrote:
PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.
Wasn't this PL/1 feature "inherited" from the, now rightly deprecated, >>>>> Alter/Goto in COBOL and Assigned GOTO in Fortran?
Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
supports it
<https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>
What makes you think that it is "rightly" to deprecate or delete this
feature?
Because it could, and often did, make the code "unfollowable". That is,
you are reading the code, following it to try to figure out what it is
doing and come to an assigned/alter goto, and you don't know where to go
next. The value was set some place else in the code, who knows where,
and thus what value it was set to, and people/programmers just aren't
used to being able to follow code like that.
Take an example use: A VM interpreter. With labels-as-values it looks
like this:
void engine(char *source)
{
void *insts[] = {&&add, &&load, &&ip, ...};
void **ip=compile_to_vm_code(source,insts);
goto *ip++;
add:
...
goto *ip++;
load:
...
goto *ip++;
store:
...
goto *ip++;
...
}
So of course you don't know where one of the gotos goes to, because
that depends on the VM code, which depends on the source code.
Now let's see how it looks with switch:
void engine(char *source)
{
typedef enum {add, load, store,...} inst;
inst *ip=compile_to_vm_code(source,insts);
for (;;) {
switch (*ip++) {
add:
...
break;
load:
...
break;
store:
...
break;
...
}
}
}
Do you know any better which of the "..." is executed next?
On 2025-11-06 11:43, Michael S wrote:
On Wed, 5 Nov 2025 17:26:44 +0200
Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:
On 2025-11-05 7:17, Anton Ertl wrote:
[ snip ]
Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer
architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.
Why does standard C not have it? C had it up to and including the
6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went
away between 6th and 7th edition. Ritchie wrote
<37178013.A1EE3D4F@bell-labs.com>:
| I eliminated them because I didn't know what to say about their
| semantics.
Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.
I don't know what Stallman said, or would have said if asked, but I
guess something like "the semantics is a jump to the (address of
the) label to which the value refers", which is machine-level
semantics and not semantics in the abstract C machine.
The problem in the abstract C machine is a "goto label-value"
statement where the label-value refers to a label in a different
function. Does gcc prevent that at compile time? If not, I would
expect the semantics to be Undefined Behavior, the usual cop-out
when nothing useful can be said.
Yes, UB sounnds as the best answer..
The point is that Ritchie was not satisfied with that answer, which
is why he removed labels-as-values from his version of C. I doubt
that Stallman had any better answer for gcc, but he did not care.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 11/4/2025 9:17 PM, Anton Ertl wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 11/4/2025 11:15 AM, MitchAlsup wrote:
PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.
Wasn't this PL/1 feature "inherited" from the, now rightly deprecated, >>>>> Alter/Goto in COBOL and Assigned GOTO in Fortran?
Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
supports it
<https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>
What makes you think that it is "rightly" to deprecate or delete this
feature?
Because it could, and often did, make the code "unfollowable". That is,
you are reading the code, following it to try to figure out what it is
doing and come to an assigned/alter goto, and you don't know where to go
next. The value was set some place else in the code, who knows where,
and thus what value it was set to, and people/programmers just aren't
used to being able to follow code like that.
Take an example use: A VM interpreter. With labels-as-values it looks
like this:
void engine(char *source)
{
void *insts[] = {&&add, &&load, &&ip, ...};
void **ip=compile_to_vm_code(source,insts);
goto *ip++;
add:
...
goto *ip++;
load:
...
goto *ip++;
store:
...
goto *ip++;
...
}
So of course you don't know where one of the gotos goes to, because
that depends on the VM code, which depends on the source code.
Now let's see how it looks with switch:
void engine(char *source)
{
typedef enum {add, load, store,...} inst;
inst *ip=compile_to_vm_code(source,insts);
for (;;) {
switch (*ip++) {
add:
...
break;
load:
...
break;
store:
...
break;
...
}
}
}
Do you know any better which of the "..." is executed next? Of course
not, for the same reason. Likewise for call threading, but there the
VM instruction implementations can be discributed across many source
files. With the replicated switch, the problem of predictability is
the same, but there is lots of extra code, with many direct gotos.
If you implement, say, a state machine using labels-as-values, or
switch, again, the logic behind it is the same and the predictability
is the same between the two implementations.
BTW, you mentioned that it could be implemented as an indirect jump. It
could for those architectures that supported that feature, but it could
also be implemented by having the Alter/Assign modify the code (i.e.
change the address in the jump/branch instruction), and self modifying
code is just bad.
On such architectures switch would also be implemented by modifying
the code,
and indirect calls and method dispatch would also be
implemented by modifying the code. If self-modifying code is "just
bad", and any language features that are implemented on some long-gone architectures using self-modifying code are bad by association, then
we have to get rid of all of these language features ASAP.
One interesting aspect here is that the Fortran assigned goto and GNU
C's goto * (to go with labels-as-values) look more like something that
may have been inspired by a modern indirect branch than by
self-modifying code.
I only dimly remember the Cobol thing, but IIRC
this looked more like something that's intended to be implemented by self-modifying code. I don't know how the PL/I solution looked like.
As did COBOL, called goto depending on, but those features didn't suffer
the problems of assigned/alter gotos.
As demonstrated above, they do.
And if you fall back to using ifs, it--
does not get any better, either.
- anton
Thomas Koenig <tkoenig@netcologne.de> posted:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
Thomas Koenig <tkoenig@netcologne.de> posted:
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
I still think the IBM DFP people did an impressively good job packing >> >> > that much data into a decimal representation. :-)
Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.
Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.
I played around with the formulas from the POWER manual a bit,
using Berkeley abc for logic optimization, for the conversion
of the packed modulo 1000 to three BCD digits.
Without spending too much effort, I arrived at four gate delays
(INV -> OAI21 -> NAND2 -> NAND2) with a total of 37 gates optimizing
for speed, or five gate delays optimizing for space.
Since the gates hang off flip-flops, you don't need the inv gate
at the front. Flip-flops can easily give both true and complement
outputs.
On 2025-11-05 7:17, Anton Ertl wrote:
Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.
I don't know what Stallman said, or would have said if asked, but I
guess something like "the semantics is a jump to the (address of the)
label to which the value refers", which is machine-level semantics and
not semantics in the abstract C machine.
The problem in the abstract C machine is a "goto label-value" statement >where the label-value refers to a label in a different function. Does
gcc prevent that at compile time? If not, I would expect the semantics
to be Undefined Behavior, the usual cop-out when nothing useful can be said.
Where this might be a problem is if the label variable was a
global symbol and the target labels were in other name spaces.
At that point it could treat it like a pointer to a function and
have to spill all live register variables to memory.
On 2025-11-05 23:28, MitchAlsup wrote:
Niklas Holsti <niklas.holsti@tidorum.invalid> posted:----------------
But then you could get the problem of a longjmp to a setjmp value that
is stale because the targeted function invocation (stack frame) is no
longer there.
But YOU had to pass the jumpbuf out of the setjump() scope.
Now, YOU complain there is a hole in your own foot with a smoking gun
in your own hand.
That is not the issue. The question is if the semantics of "goto label-valued-variable" are hard to define, as Ritchie said, or not, as
Anton thinks Stallman said or would have said.
The discussion above shows that whether a label value is implemented as
a bare code address, or as a jumpbuf, some cases will have Undefined Behavior semantics. So I think Ritchie was right, unless the undefined
cases can be excluded at compile time.
The undefined cases could be excluded at compile-time, even in C, by requiring all label-valued variables to be local to some function and forbidding passing such values as parameters or function results. In addition, the use of an uninitialized label-valued variable should be prevented or detected. Perhaps Anton could accept such restrictions.
Niklas
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
So does gfortran support assigned goto, too?
Yes.
What problems in
interaction with other features do you see?
In this case, it is more the problem of modern architeectures.
On 32-bit architectures, it might have been possible to stash
the address of a jump target in an actual INTEGER variable and
GO TO there. On a 64-bit architecture, this is not possible, so
you need to have a shadow variable for the pointer
On 2025-11-05 4:21 p.m., MitchAlsup wrote:
Robert Finch <robfi680@gmail.com> posted:
Qupls2026 currently supports 48-bit inline constants. I am debatingMy 66000 ISA has OpCodes in the range {I.major >= 8 && I.major < 24}
whether to support 89 and 130-bit inline constants as well. Constant
sizes increase by 41-bits due to the 48-bit instruction word size. The
larger constants would require more instruction words to be available to >> be processed in decode. Not sure if it is even possible to pass a
constant larger than 64-bits in the machine.
I just realized that constant operand routing was already in Qupls, I
had just not specifically identified it. The operand routing bits are
just moved into a postfix instruction word rather than the first
instruction word. This gives more bits available in the instruction
word. Rather than burn a couple of bits in every R3 type instruction,
another couple of opcodes are used to represent constant extensions.
that can supply constants and perform operand routing. Within this
range; instruction<8:5> specify the following table:
0 0 0 0 +Src1 +Src2
0 0 0 1 +Src1 -Src2
0 0 1 0 -Src1 +Src2
0 0 1 1 -Src1 -Src2
0 1 0 0 +Src1 +imm5
0 1 0 1 +Imm5 +Src2
0 1 1 0 -Src1 -Imm5
0 1 1 1 +Imm5 -Src2
1 0 0 0 +Src1 Imm32
1 0 0 1 Imm32 +Src2
1 0 1 0 -Src1 Imm32
1 0 1 1 Imm32 -Src2
1 1 0 0 +Src1 Imm64
1 1 0 1 Imm64 +Src2
1 1 1 0 -Src1 Imm64
1 1 1 1 Imm64 -Src2
What happens if one tries to use an unsupported combination?
Here we have access to {5, 32, 64}-bit constants, 16-bit constants
come from different OpCodes.
Imm5 are the register specifier bits: range {-31..31} for integer and logical, range {-15.5..15.5} for floating point.
I just realized that Qupls2026 does not accommodate small constants very well except for a few instructions like shift and bitfield instructions which have special formats. Sure, constants can be made to override
register specs, but they take up a whole additional word. I am not sure
how big a deal this is as there are also immediate forms of instructions with the constant encoded in the instruction, but these do not allow
operand routing. There is a dedicated subtract from immediate
instruction. A lot of other instructions are commutative, so operand
routing is not needed.
Qupls has potentially 25, 48, 89 and 130-bit constants. 7-bit constants
are available for shifts and bitfield ops. Leaving the 130-bit constants
out for now. They may be useful for 128-bit SIMD against constant operands.
The constant routing issue could maybe be fixed as there are 30+ free opcodes still. But there needs to be more routing bits with three source operands. All the permutations may get complicated to encode and allow
for in the compiler. May want to permute two registers and a constant,
or two constants and a register, and then three or four different sizes.
Qupls strives to be the low-cost processor.
On 11/5/2025 1:21 PM, MitchAlsup wrote:
Robert Finch <robfi680@gmail.com> posted:
Qupls2026 currently supports 48-bit inline constants. I am debatingMy 66000 ISA has OpCodes in the range {I.major >= 8 && I.major < 24}
whether to support 89 and 130-bit inline constants as well. Constant
sizes increase by 41-bits due to the 48-bit instruction word size. The
larger constants would require more instruction words to be available to >> be processed in decode. Not sure if it is even possible to pass a
constant larger than 64-bits in the machine.
I just realized that constant operand routing was already in Qupls, I
had just not specifically identified it. The operand routing bits are
just moved into a postfix instruction word rather than the first
instruction word. This gives more bits available in the instruction
word. Rather than burn a couple of bits in every R3 type instruction,
another couple of opcodes are used to represent constant extensions.
that can supply constants and perform operand routing. Within this
range; instruction<8:5> specify the following table:
0 0 0 0 +Src1 +Src2
0 0 0 1 +Src1 -Src2
0 0 1 0 -Src1 +Src2
0 0 1 1 -Src1 -Src2
0 1 0 0 +Src1 +imm5
0 1 0 1 +Imm5 +Src2
0 1 1 0 -Src1 -Imm5
0 1 1 1 +Imm5 -Src2
1 0 0 0 +Src1 Imm32
1 0 0 1 Imm32 +Src2
1 0 1 0 -Src1 Imm32
1 0 1 1 Imm32 -Src2
1 1 0 0 +Src1 Imm64
1 1 0 1 Imm64 +Src2
1 1 1 0 -Src1 Imm64
1 1 1 1 Imm64 -Src2
Here we have access to {5, 32, 64}-bit constants, 16-bit constants
come from different OpCodes.
Imm5 are the register specifier bits: range {-31..31} for integer and logical, range {-15.5..15.5} for floating point.
Some time ago, we discussed using the 5 bit immediates in floating point instructions as an index to an internal ROM with frequently used
constants. The idea is that it would save some space in the instruction stream. Are you implementing that, and if not, why not?
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 11/4/2025 9:17 PM, Anton Ertl wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 11/4/2025 11:15 AM, MitchAlsup wrote:
PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.
Wasn't this PL/1 feature "inherited" from the, now rightly deprecated, >>>> Alter/Goto in COBOL and Assigned GOTO in Fortran?
Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
supports it
<https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>
What makes you think that it is "rightly" to deprecate or delete this
feature?
Because it could, and often did, make the code "unfollowable". That is, >you are reading the code, following it to try to figure out what it is >doing and come to an assigned/alter goto, and you don't know where to go >next. The value was set some place else in the code, who knows where,
and thus what value it was set to, and people/programmers just aren't
used to being able to follow code like that.
Take an example use: A VM interpreter. With labels-as-values it looks
like this:
void engine(char *source)
{
void *insts[] = {&&add, &&load, &&ip, ...};
void **ip=compile_to_vm_code(source,insts);
goto *ip++;
add:
...
goto *ip++;
load:
...
goto *ip++;
store:
...
goto *ip++;
...
}
So of course you don't know where one of the gotos goes to, because
that depends on the VM code, which depends on the source code.
Now let's see how it looks with switch:
void engine(char *source)
{
typedef enum {add, load, store,...} inst;
inst *ip=compile_to_vm_code(source,insts);
for (;;) {
switch (*ip++) {
add:
...
break;
load:
...
break;
store:
...
break;
...
}
}
}
Do you know any better which of the "..." is executed next? Of course--- Synchronet 3.21a-Linux NewsLink 1.2
not, for the same reason. Likewise for call threading, but there the
VM instruction implementations can be discributed across many source
files. With the replicated switch, the problem of predictability is
the same, but there is lots of extra code, with many direct gotos.
If you implement, say, a state machine using labels-as-values, or
switch, again, the logic behind it is the same and the predictability
is the same between the two implementations.
BTW, you mentioned that it could be implemented as an indirect jump. It >could for those architectures that supported that feature, but it could >also be implemented by having the Alter/Assign modify the code (i.e. >change the address in the jump/branch instruction), and self modifying >code is just bad.
On such architectures switch would also be implemented by modifying
the code, and indirect calls and method dispatch would also be
implemented by modifying the code. If self-modifying code is "just
bad", and any language features that are implemented on some long-gone architectures using self-modifying code are bad by association, then
we have to get rid of all of these language features ASAP.
One interesting aspect here is that the Fortran assigned goto and GNU
C's goto * (to go with labels-as-values) look more like something that
may have been inspired by a modern indirect branch than by
self-modifying code. I only dimly remember the Cobol thing, but IIRC
this looked more like something that's intended to be implemented by self-modifying code. I don't know how the PL/I solution looked like.
As did COBOL, called goto depending on, but those features didn't suffer >the problems of assigned/alter gotos.
As demonstrated above, they do. And if you fall back to using ifs, it
does not get any better, either.
- anton
On Wed, 05 Nov 2025 21:06:16 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Michael S <already5chosen@yahoo.com> posted:
On Tue, 04 Nov 2025 22:51:28 GMTA SW solution based on how it would be done in HW.
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Thomas Koenig <tkoenig@netcologne.de> posted:
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
I still think the IBM DFP people did an impressively good job
packing that much data into a decimal representation. :-)
Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.
Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.
Other than "crap loads" of {deMorganizing and gate optimization}
that is essentially what HW actually does.
You still need to build 12-bit decimal ALUs to string together
Are talking about hardware or software?
Then, I suspect that you didn't understand objection of Thomas Koenig.
1. Format of interest is Decimal128. https://en.wikipedia.org/wiki/Decimal128_floating-point_format
2. According to my understanding, Thomas didn't suggest that *slow*
software implementation of DPD-encoded DFP, i.e. implementation that
only cares about correctness, is hard.
3. OTOH, he seems to suspects, and I agree with him, that *non-slow*
software implementation, the one comparable in speed (say, within
factor of 1,5-2) to competent implementation of the same DFP operations
in BID format, is not easy. If at all possible.
4. All said above assumes an absence of HW assists.
BTW, at least for multiplication, I would probably would not do my
arithmetic in BCD domain.
Instead, I'd convert 10+ DPD digits to two Base_1e18 digits (11 look
ups per operand, 22 total look ups + ~40 shifts + ~20 ANDs + ~20
additions).
Then I'd do multiplication and normalization and rounding in Base_1e18.
Then I'd convert from Base_1e18 to Base_1000. The ideas of such
conversion are similar to fast binary-to-BCD conversion that I
demonstrated her decade or so ago. AVX2 could be quite helpful at that
stage.
Then I'd have to convert the result from Base_1000 to DPD. Here, again,
11 table look-ups + plenty of ANDs/shift/ORs seem inevitable.
May be, at that stage SIMD gather can be of help, but I have my doubts.
So far, every time I tried gather I was disappointed with performance.
Overall, even with seemingly decent plan like sketched above, I'd expect
DPD multiplication to be 2.5x to 3x slower than BID. But, then again,
in the past my early performance estimates were wrong quite often.
Some time ago, we discussed using the 5 bit immediates in floating point instructions as an index to an internal ROM with frequently used
constants. The idea is that it would save some space in the instruction stream. Are you implementing that, and if not, why not?
EricP <ThatWouldBeTelling@thevillage.com> writes:
Where this might be a problem is if the label variable was a
global symbol and the target labels were in other name spaces.
At that point it could treat it like a pointer to a function and
have to spill all live register variables to memory.
Does the assigned goto support that?
What about regular goto and
computed goto?
Thomas Koenig <tkoenig@netcologne.de> writes:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
So does gfortran support assigned goto, too?
Yes.
Cool.
What problems in
interaction with other features do you see?
In this case, it is more the problem of modern architeectures.
On 32-bit architectures, it might have been possible to stash
the address of a jump target in an actual INTEGER variable and
GO TO there. On a 64-bit architecture, this is not possible, so
you need to have a shadow variable for the pointer
Implementation options that come to my mind are:
1) Have the code in the bottom 4GB (or maybe 2GB), and a 32-bit
variable is sufficient. AFAIK on some 64-bit architectures the
default memory model puts the code in the bottom 4GB or 2GB.
2) Put the offset from the start of the function or compilation unit (whatever scope the assigned goto can be used in) in the 32-bit
variable. 32 bits should be enough for that.
Of course, if Fortran
assigns labels between shared libraries and the main program,
How does ifort deal with this problem?
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
Some time ago, we discussed using the 5 bit immediates in floating point
instructions as an index to an internal ROM with frequently used
constants. The idea is that it would save some space in the instruction
stream. Are you implementing that, and if not, why not?
I did some statistics on which floating point constants occurred how
often, looking at three different packages (Perl, gnuplot and GSL).
GSL implements a lot of special founctions, so it has a lot of
constants you are not likely to find often in a random sample of
other packages :-) Perl has very little floating point. gnuplot
is also special in its own way, of course.
A few constants occur quite often, but there are a lot of
differences between the floating point constants for different
programs, to nobody's surprise (presumably).
Here is the head of an output of a little script I wrote to count
all floating-point constants from My66000 assembler. Note that
the compiler is for the version that does not yet do 0.5 etc as
floating point. The first number is the number of occurrences,
the second one is the constant itself.
5-bit constants: 886
32-bit constants: 566
64-bit constants:597
303 0
290 1
96 0.5
81 6
58 -1
58 1e-14
49 2
46 -2
45 -8.98846567431158e+307
44 10
44 255
37 8.98846567431158e+307
29 -0.5
28 3
27 90
27 360
26 -1e-05
21 0.0174532925199433
20 0.9
18 -3
17 180
17 0.1
17 0.01
[...]
Thomas Koenig <tkoenig@netcologne.de> writes:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
So does gfortran support assigned goto, too?
Yes.
Cool.
What problems in
interaction with other features do you see?
In this case, it is more the problem of modern architeectures.
On 32-bit architectures, it might have been possible to stash
the address of a jump target in an actual INTEGER variable and
GO TO there. On a 64-bit architecture, this is not possible, so
you need to have a shadow variable for the pointer
Implementation options that come to my mind are:
1) Have the code in the bottom 4GB (or maybe 2GB), and a 32-bit
variable is sufficient. AFAIK on some 64-bit architectures the
default memory model puts the code in the bottom 4GB or 2GB.
2) Put the offset from the start of the function or compilation unit (whatever scope the assigned goto can be used in) in the 32-bit
variable. 32 bits should be enough for that.
Of course, if Fortran--- Synchronet 3.21a-Linux NewsLink 1.2
assigns labels between shared libraries and the main program, that
approach probably does not work, but does anybody really do that?
How does ifort deal with this problem?
- anton
EricP <ThatWouldBeTelling@thevillage.com> writes:
Where this might be a problem is if the label variable was a
global symbol and the target labels were in other name spaces.
At that point it could treat it like a pointer to a function and
have to spill all live register variables to memory.
Does the assigned goto support that? What about regular goto and
computed goto?
- anton
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
Some time ago, we discussed using the 5 bit immediates in floating point instructions as an index to an internal ROM with frequently used constants. The idea is that it would save some space in the instruction stream. Are you implementing that, and if not, why not?
I did some statistics on which floating point constants occurred how
often, looking at three different packages (Perl, gnuplot and GSL).
GSL implements a lot of special founctions, so it has a lot of
constants you are not likely to find often in a random sample of
other packages :-) Perl has very little floating point. gnuplot
is also special in its own way, of course.
A few constants occur quite often, but there are a lot of
differences between the floating point constants for different
programs, to nobody's surprise (presumably).
Here is the head of an output of a little script I wrote to count
all floating-point constants from My66000 assembler. Note that
the compiler is for the version that does not yet do 0.5 etc as
floating point. The first number is the number of occurrences,
the second one is the constant itself.
5-bit constants: 886
32-bit constants: 566
64-bit constants:597
303 0
290 1
96 0.5
81 6
58 -1
58 1e-14
49 2
46 -2
45 -8.98846567431158e+307
44 10
44 255
37 8.98846567431158e+307
29 -0.5
28 3
27 90
27 360
26 -1e-05
21 0.0174532925199433
20 0.9
18 -3
17 180
17 0.1
17 0.01
[...]
That is not the issue. The question is if the semantics of "goto
label-valued-variable" are hard to define, as Ritchie said, or not, as
Anton thinks Stallman said or would have said.
So, label-variables are hard to define, but function-variables are not ?!?
Anton Ertl wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
Where this might be a problem is if the label variable was a
global symbol and the target labels were in other name spaces.
At that point it could treat it like a pointer to a function and
have to spill all live register variables to memory.
Does the assigned goto support that? What about regular goto and
computed goto?
- anton
I didn't mean to imply that it did.
As far as I remember, Fortran 77 does not allow it.
I never used later Fortrans.
I hadn't given the dynamic branch topic any thought until you raised it
and this was just me working through the things a compiler might have
to deal with.
I have written jump dispatch table code myself where the destinations
came from symbols external to the routine, but I had to switch to
inline assembler for this as MS C does not support goto variables,
and it was up to me to make sure the registers were all handled correctly.
That is not the issue. The question is if the semantics of "goto >label-valued-variable" are hard to define, as Ritchie said, or not, as
Anton thinks Stallman said or would have said.
The discussion above shows that whether a label value is implemented as
a bare code address, or as a jumpbuf, some cases will have Undefined >Behavior semantics. So I think Ritchie was right, unless the undefined
cases can be excluded at compile time.
The undefined cases could be excluded at compile-time, even in C, by >requiring all label-valued variables to be local to some function and >forbidding passing such values as parameters or function results.
In
addition, the use of an uninitialized label-valued variable should be >prevented or detected.
EricP <ThatWouldBeTelling@thevillage.com> posted:
Anton Ertl wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:I didn't mean to imply that it did.
Where this might be a problem is if the label variable was aDoes the assigned goto support that? What about regular goto and
global symbol and the target labels were in other name spaces.
At that point it could treat it like a pointer to a function and
have to spill all live register variables to memory.
computed goto?
- anton
As far as I remember, Fortran 77 does not allow it.
I never used later Fortrans.
I hadn't given the dynamic branch topic any thought until you raised it
and this was just me working through the things a compiler might have
to deal with.
I have written jump dispatch table code myself where the destinations
came from symbols external to the routine, but I had to switch to
inline assembler for this as MS C does not support goto variables,
Oh sure it does--it is called Return-Oriented-Programming.
You take the return address off the stack and insert your
go-to label on the stack and then just return.
Or you could do some "foul play" on a jumpbuf and longjump.
{{Be careful not to shoot yourself in the foot.}}
--- Synchronet 3.21a-Linux NewsLink 1.2and it was up to me to make sure the registers were all handled correctly. >>
After 4 years of looking, we are still waiting for a single function
that needs more than a scaled 16-bit displacement from current IP
{±17-bits} to reach all labels within the function.
On 2025-11-06 11:43, Michael S wrote:...
On Wed, 5 Nov 2025 17:26:44 +0200
Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:
On 2025-11-05 7:17, Anton Ertl wrote:
Why does standard C not have it? C had it up to and including the
6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
between 6th and 7th edition. Ritchie wrote
<37178013.A1EE3D4F@bell-labs.com>:
| I eliminated them because I didn't know what to say about their
| semantics.
Yes, UB sounnds as the best answer..
The point is that Ritchie was not satisfied with that answer, which is
why he removed labels-as-values from his version of C.
On 2025-11-06 10:46, Anton Ertl wrote:[Fortran's assigned goto]
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
Because it could, and often did, make the code "unfollowable". That is, >>> you are reading the code, following it to try to figure out what it is
doing and come to an assigned/alter goto, and you don't know where to go >>> next. The value was set some place else in the code, who knows where,
and thus what value it was set to, and people/programmers just aren't
used to being able to follow code like that.
Take an example use: A VM interpreter. With labels-as-values it looks
like this:
void engine(char *source)
{
void *insts[] = {&&add, &&load, &&ip, ...};
void **ip=compile_to_vm_code(source,insts);
goto *ip++;
add:
...
goto *ip++;
load:
...
goto *ip++;
store:
...
goto *ip++;
...
}
So of course you don't know where one of the gotos goes to, because
that depends on the VM code, which depends on the source code.
I'm not sure if you are trolling or serious, but I will assume the latter.
The point is that without a deep analysis of the program you cannot be
sure that these goto's actually go to one of the labels in the engine() >function, and not to some other location in the code, perhaps in some
other function. That analysis would have to discover that the >compile_to_vm_code() function returns a pointer to a vector of addresses >picked from the insts[] vector. That could need an analysis of many >functions called from compile_to_vm_code(), the history of the whole
program execution, and so on. NOT easy.
On 11/6/2025 12:46 AM, Anton Ertl wrote:
If you implement, say, a state machine using labels-as-values, or
switch, again, the logic behind it is the same and the predictability
is the same between the two implementations.
Nick responded better than I could to this argument, demonstrating how
it isn't true. As I said, in the hands of a good programmer, you might >assume that the goto goes to one of those labels, but you can't be sure
of it.
BTW, you mentioned that it could be implemented as an indirect jump. It >>> could for those architectures that supported that feature, but it could
also be implemented by having the Alter/Assign modify the code (i.e.
change the address in the jump/branch instruction), and self modifying
code is just bad.
On such architectures switch would also be implemented by modifying
the code,
I don't think so. Switch can, and I understand usually is,implemented
via an index into a jump table. No self modifying code required.
and indirect calls and method dispatch would also be
implemented by modifying the code. If self-modifying code is "just
bad", and any language features that are implemented on some long-gone
architectures using self-modifying code are bad by association, then
we have to get rid of all of these language features ASAP.
And, by an large they have.
One interesting aspect here is that the Fortran assigned goto and GNU
C's goto * (to go with labels-as-values) look more like something that
may have been inspired by a modern indirect branch than by
self-modifying code.
Well, the Fortran feature was designed in what, the late 1950s? Back
then, self modifying code wasn't considered as bad as it now is.
An extra feature: When using GOTO variable, you can also supply a
list of labels that it should jump to; if the jump target is not
in the list, the GOTO variable is illegal.
In languages with nested scopes, label gotos
can jump to an outer scope so they have to unwind some frames. Back when >people used such things, a common use was on an error to jump out to some >recovery code.
Function pointers have a sort of similar problem in that they need to carry >along pointers to all of the enclosing frames the function can see. That is >reasonably well solved by displays, give or take the infamous Knuth man or boy >program, 13 lines of Algol60 horror that Knuth himself got the results wrong.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 11/6/2025 12:46 AM, Anton Ertl wrote:
If you implement, say, a state machine using labels-as-values, or
switch, again, the logic behind it is the same and the predictability
is the same between the two implementations.
Nick responded better than I could to this argument, demonstrating how
it isn't true. As I said, in the hands of a good programmer, you might
assume that the goto goes to one of those labels, but you can't be sure
of it.
In <1762311070-5857@newsgrouper.org> you
mentioned method calls as
'just a more expensive "label"', there you know that the method call
calls one of the implementations of the method with the name, like
with the switch. You did not find that satisfying in <1762311070-5857@newsgrouper.org>, but now knowing that it's one of a
large number of switch targets is good enough for you, whereas Niklas Holsti's problem (which does not occur in my practical experience with labels-as-values) has become your problem?
BTW, you mentioned that it could be implemented as an indirect jump. It >>>> could for those architectures that supported that feature, but it could >>>> also be implemented by having the Alter/Assign modify the code (i.e.
change the address in the jump/branch instruction), and self modifying >>>> code is just bad.
On such architectures switch would also be implemented by modifying
the code,
I don't think so. Switch can, and I understand usually is,implemented
via an index into a jump table. No self modifying code required.
What does "index into a jump table" mean in one of those architectures
that did not have indirect jumps and used self-modifying code instead?
I bet that it ends up in self-modifying code, too, because these architectures usually don't have indirect jumps through jump tables,
either.
If they had, the easy way to implement indirect branches
without self-modifying code would be to have a one-entry jump table,
store the target in that entry, and then perform an indirect jump
through that jump table.
and indirect calls and method dispatch would also be
implemented by modifying the code. If self-modifying code is "just
bad", and any language features that are implemented on some long-gone
architectures using self-modifying code are bad by association, then
we have to get rid of all of these language features ASAP.
And, by an large they have.
We have gotten rid of indirect calls, e.g., in higher-order functions
in functional programming languages? We have gotten rid of dynamic
method dispatch in object-oriented programs.
Thinking about the things that self-modifying code has been used for
on some architecture, IIRC that also includes array indexing. So have
we gotten rid of array indexing in programming languages?
One interesting aspect here is that the Fortran assigned goto and GNU
C's goto * (to go with labels-as-values) look more like something that
may have been inspired by a modern indirect branch than by
self-modifying code.
Well, the Fortran feature was designed in what, the late 1950s? Back
then, self modifying code wasn't considered as bad as it now is.
Did you read what you are replying to?
Does the IBM 704 (for which FORTRAN has been designed originally)
support indirect branches, or was it necessary to implement the
assigned goto (and computed goto) with self-modifying code on that architecture?
On 11/6/2025 11:38 AM, Thomas Koenig wrote:
Here is the head of an output of a little script I wrote to count
all floating-point constants from My66000 assembler. Note that
the compiler is for the version that does not yet do 0.5 etc as
floating point. The first number is the number of occurrences,
the second one is the constant itself.
5-bit constants: 886
32-bit constants: 566
64-bit constants:597
303 0
290 1
96 0.5
81 6
58 -1
58 1e-14
49 2
46 -2
45 -8.98846567431158e+307
44 10
44 255
37 8.98846567431158e+307
29 -0.5
28 3
27 90
27 360
26 -1e-05
21 0.0174532925199433
20 0.9
18 -3
17 180
17 0.1
17 0.01
[...]
Interesting! No values related to pi? And what are the ...e+307 used for?
On 11/7/2025 2:09 AM, Anton Ertl wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 11/6/2025 12:46 AM, Anton Ertl wrote:
On such architectures switch would also be implemented by modifying
the code,
I don't think so. Switch can, and I understand usually is,implemented
via an index into a jump table. No self modifying code required.
What does "index into a jump table" mean in one of those architectures
that did not have indirect jumps and used self-modifying code instead?
For example, the following Fortran code
goto (10,20,30,40) I @ will jump to label 10 if I =1, 20 if I = 2, etc
would be compiled to something like (add any required "bounds checking"
for I)
load R1,I
Jump $,R1
Jump 10
Jump 20
Jump 30
Jump 40
No code modification nor indirection required .
and indirect calls and method dispatch would also be
implemented by modifying the code. If self-modifying code is "just
bad", and any language features that are implemented on some long-gone >>>> architectures using self-modifying code are bad by association, then
we have to get rid of all of these language features ASAP.
And, by an large they have.
We have gotten rid of indirect calls, e.g., in higher-order functions
in functional programming languages? We have gotten rid of dynamic
method dispatch in object-oriented programs.
No, and I defer to you, or others here, on how these features are >implemented, specifically whether code modification is required. I was >referring to features such as assigned goto in Fortran, and Alter goto
in Cobol.
Thinking about the things that self-modifying code has been used for
on some architecture, IIRC that also includes array indexing. So have
we gotten rid of array indexing in programming languages?
Of course not. But I suspect that we have "gotten rid of" any
architecture that *requires* code modification for array indexing.
John Levine <johnl@taugh.com> writes:
In languages with nested scopes, label gotos
can jump to an outer scope so they have to unwind some frames. Back when people used such things, a common use was on an error to jump out to some recovery code.
Pascal has that feature. Concerning error handling, jumping to an
error handler in a statically enclosing scope has fallen out of
favour, but throwing an exception to the next dynamically enclosing
exception handler is supported in a number of languages.
Function pointers have a sort of similar problem in that they need to carry along pointers to all of the enclosing frames the function can see. That is reasonably well solved by displays, give or take the infamous Knuth man or boy
program, 13 lines of Algol60 horror that Knuth himself got the results wrong.
Displays and static link chains are among the techniques that can be
used to implement static scoping correctly, i.e., where the man-or-boy
test produces the correct result. Knuth initially got the result
wrong, because he only had boy compilers, and the computation is too
involved to do it by hand.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 11/7/2025 2:09 AM, Anton Ertl wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 11/6/2025 12:46 AM, Anton Ertl wrote:
On such architectures switch would also be implemented by modifying
the code,
I don't think so. Switch can, and I understand usually is,implemented >>>> via an index into a jump table. No self modifying code required.
What does "index into a jump table" mean in one of those architectures
that did not have indirect jumps and used self-modifying code instead?
For example, the following Fortran code
goto (10,20,30,40) I @ will jump to label 10 if I =1, 20 if I = 2, etc >>
would be compiled to something like (add any required "bounds checking"
for I)
load R1,I
Jump $,R1
Jump 10
Jump 20
Jump 30
Jump 40
Which architecture ist that?
No code modification nor indirection required .
The "Jump $,R1" is an indirect jump.
With that the assigned goto can
be implemented as (for "GOTO X")
load R1,X
Jump 0,R1
and indirect calls and method dispatch would also be
implemented by modifying the code. If self-modifying code is "just
bad", and any language features that are implemented on some long-gone >>>>> architectures using self-modifying code are bad by association, then >>>>> we have to get rid of all of these language features ASAP.
And, by an large they have.
We have gotten rid of indirect calls, e.g., in higher-order functions
in functional programming languages? We have gotten rid of dynamic
method dispatch in object-oriented programs.
No, and I defer to you, or others here, on how these features are
implemented, specifically whether code modification is required. I was
referring to features such as assigned goto in Fortran, and Alter goto
in Cobol.
On modern architectures higher-order functions are implemented with
indirect branches or indirect calls (depending on whether it's a
tail-call or not); likewise for method dispatch.
I do not know how Lisp, FORTRAN, Algol 60 and other early languages
with higher-order functions were implemented on architectures that do
not have indirect branches; but if the assigned goto was implemented
with self-modifying code, the call to a function in a variable was
probably implemented like that, too.
Thinking about the things that self-modifying code has been used for
on some architecture, IIRC that also includes array indexing. So have
we gotten rid of array indexing in programming languages?
Of course not. But I suspect that we have "gotten rid of" any
architecture that *requires* code modification for array indexing.
We have also gotten rid of any architecture that requires
self-modifying code for implementing the assigned goto.
On 11/6/2025 3:24 AM, Michael S wrote:
On Wed, 05 Nov 2025 21:06:16 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Michael S <already5chosen@yahoo.com> posted:
On Tue, 04 Nov 2025 22:51:28 GMTA SW solution based on how it would be done in HW.
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Thomas Koenig <tkoenig@netcologne.de> posted:
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
I still think the IBM DFP people did an impressively good job
packing that much data into a decimal representation. :-)
Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.
Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.
Other than "crap loads" of {deMorganizing and gate optimization}
that is essentially what HW actually does.
You still need to build 12-bit decimal ALUs to string together
Are talking about hardware or software?
Then, I suspect that you didn't understand objection of Thomas Koenig.
1. Format of interest is Decimal128.
https://en.wikipedia.org/wiki/Decimal128_floating-point_format
2. According to my understanding, Thomas didn't suggest that *slow*
software implementation of DPD-encoded DFP, i.e. implementation that
only cares about correctness, is hard.
3. OTOH, he seems to suspects, and I agree with him, that *non-slow*
software implementation, the one comparable in speed (say, within
factor of 1,5-2) to competent implementation of the same DFP operations
in BID format, is not easy. If at all possible.
4. All said above assumes an absence of HW assists.
BTW, at least for multiplication, I would probably would not do my
arithmetic in BCD domain.
Instead, I'd convert 10+ DPD digits to two Base_1e18 digits (11 look
ups per operand, 22 total look ups + ~40 shifts + ~20 ANDs + ~20
additions).
Then I'd do multiplication and normalization and rounding in Base_1e18.
Then I'd convert from Base_1e18 to Base_1000. The ideas of such
conversion are similar to fast binary-to-BCD conversion that I
demonstrated her decade or so ago. AVX2 could be quite helpful at that
stage.
Then I'd have to convert the result from Base_1000 to DPD. Here, again,
11 table look-ups + plenty of ANDs/shift/ORs seem inevitable.
May be, at that stage SIMD gather can be of help, but I have my doubts.
So far, every time I tried gather I was disappointed with performance.
Overall, even with seemingly decent plan like sketched above, I'd expect
DPD multiplication to be 2.5x to 3x slower than BID. But, then again,
in the past my early performance estimates were wrong quite often.
I decided to start working on a mockup (quickly thrown together).
I don't expect to have much use for it, but meh.
It works by packing/unpacking the values into an internal format along vaguely similar lines to the .NET format, just bigger to accommodate
more digits:
4x 32-bit values each holding 9 digits
Except the top one generally holding 7 digits.
16-bit exponent, sign byte.
Then wrote a few pack/unpack scenarios:
X30: Directly packing 20/30 bit chunks, non-standard;
DPD: Use the DPD format;
BID: Use the BID format.
For the pack/unpack step (taken in isolation):
X30 is around 10x faster than either DPD or BID;
Both DPD and BID need a similar amount of time.
BID needs a bunch of 128-bit arithmetic handlers.
DPD needs a bunch of merge/split and table lookups.
Seems to mostly balance out in this case.
For DPD, merge is effectively:
Do the table lookups;
v=v0+(v1*1000)+(v2*1000000);
With a split step like:
v0=v;
v1=v/1000;
v0-=v1*1000;
v2=v1/1000;
v1-=v2*1000;
Then, use table lookups to go back to DPD.
Did look into possible faster ways of doing the splitting, but then
noted that have not yet found a faster way that gives correct results
(where one can assume the compiler already knows how to turn divide by constant into multiply by reciprocal).
At first it seemed like a strong reason to favor X30 over either DPD or
BID. Except, that the cost of the ADD and MUL operations effectively
dwarf that of the pack/unpack operations, so the relative cost
difference between X30 and DPD may not matter much.
As is, it seems MUL and ADD being roughly 6x more than the cost of the
DPD pack/unpack steps.
So, it seems, while DPD pack/unpack isn't free, it is not something that would lead to X30 being a decisive win either in terms of performance.
It might make more sense, if supporting BID, to just do it as its own
thing (and embrace just using a bunch of 128-bit arithmetic, and a 128*128=>256 bit widening multiply, ...). Also, can note that the BID
case ends up needing a lot more clutter, mostly again because C lacks
native support for 128-bit arithmetic.
If working based on digit chunks, likely better to stick with DPD due to less clutter, etc. Though, this part would be less bad if C had had widespread support for 128-bit integers.
Though, in this case, the ADD and MUL operations currently work by internally doubling the width and then narrowing the result after normalization. This is slower, but could give exact results.
Though, still not complete nor confirmed to produce correct results.
But, yeah, might be more worthwhile to look into digit chunking:
12x 3 digits (16b chunk)
4x 9 digits (32b chunk)
2x 18 digits (64b chunk)
3x 12 digits (64b chunk)
Likely I think:
3 digits, likely slower because of needing significantly more operations;
9 digits, seemed sensible, option I went with, internal operations fully
fit within the limits of 64 bit arithmetic;
18 digits, possible, but runs into many cases internally that would
require using 128-bit arithmetic.
12 digits, fits more easily into 64-bit arithmetic, but would still sometimes exceed it; and isn't that much more than 9 digits (but would reduce the number of chunks needed from 4 to 3).
While 18 digits conceptually needs fewer abstract operations than 9
digits, it would suffer the drawback of many of these operations being notably slower.
However, if running on RV64G with the standard ABI, it is likely the 9- digit case would also take a performance hit due to sign-extended
unsigned int (and needing to spend 2 shifts whenever zero-extending a value).
With 3x 12 digits,while not exactly the densest scheme, leaves a little
more "working space" so would reduce cases which exceed the limits of
64-bit arithmetic. Well, except multiply, where 24 > 18 ...
The main merit of 9 digit chunking here being that it fully stays within
the limits of 64-bit arithmetic (where multiply temporarily widens to working with 18 digits, but then narrows back to 9 digit chunks).
Also 9 digit chunking may be preferable when one has a faster 32*32=>64
bit multiplier, but 64*64=>128 is slower.
One other possibility could be to use BCD rather than chunking, but I
expect BCD emulation to be painfully slow in the absence of ISA level helpers.
DIV uses Newton-Raphson
The process of converging is a lot more fiddly than with Binary FP.
Partly as the strategy for generating the initial guess is far less accurate.
BGB <cr88192@gmail.com> posted:
--------------snip---------------
DIV uses Newton-Raphson
The process of converging is a lot more fiddly than with Binary FP.
Partly as the strategy for generating the initial guess is far less
accurate.
Binary FDIV NR uses a 9-bit in, 11-bits out table which results in
an 8-bit accurate first iteration result.
Other than DFP not being normalized, once you find the HoD, you should
be able to use something like a 10-bit in 13-bit out table to get the
first 2 decimal digits correct, and N-R from there.
That 10-bits in could be the packed DFP representation (its denser and
has smaller tables). This way, table lookup overlaps unpacking.
On 11/6/2025 1:11 PM, BGB wrote:
On 11/6/2025 3:24 AM, Michael S wrote:
On Wed, 05 Nov 2025 21:06:16 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Michael S <already5chosen@yahoo.com> posted:
On Tue, 04 Nov 2025 22:51:28 GMTA SW solution based on how it would be done in HW.
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Thomas Koenig <tkoenig@netcologne.de> posted:
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
I still think the IBM DFP people did an impressively good job
packing that much data into a decimal representation. :-)
Yes, that modulo 1000 packing is quite clever. It is relatively >>>>>>> cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.
Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.
Other than "crap loads" of {deMorganizing and gate optimization}
that is essentially what HW actually does.
You still need to build 12-bit decimal ALUs to string together
Are talking about hardware or software?
Then, I suspect that you didn't understand objection of Thomas Koenig.
1. Format of interest is Decimal128.
https://en.wikipedia.org/wiki/Decimal128_floating-point_format
2. According to my understanding, Thomas didn't suggest that *slow*
software implementation of DPD-encoded DFP, i.e. implementation that
only cares about correctness, is hard.
3. OTOH, he seems to suspects, and I agree with him, that *non-slow*
software implementation, the one comparable in speed (say, within
factor of 1,5-2) to competent implementation of the same DFP operations
in BID format, is not easy. If at all possible.
4. All said above assumes an absence of HW assists.
BTW, at least for multiplication, I would probably would not do my
arithmetic in BCD domain.
Instead, I'd convert 10+ DPD digits to two Base_1e18 digits (11 look
ups per operand, 22 total look ups + ~40 shifts + ~20 ANDs + ~20
additions).
Then I'd do multiplication and normalization and rounding in Base_1e18.
Then I'd convert from Base_1e18 to Base_1000. The ideas of such
conversion are similar to fast binary-to-BCD conversion that I
demonstrated her decade or so ago. AVX2 could be quite helpful at that
stage.
Then I'd have to convert the result from Base_1000 to DPD. Here, again,
11 table look-ups + plenty of ANDs/shift/ORs seem inevitable.
May be, at that stage SIMD gather can be of help, but I have my doubts.
So far, every time I tried gather I was disappointed with performance.
Overall, even with seemingly decent plan like sketched above, I'd expect >>> DPD multiplication to be 2.5x to 3x slower than BID. But, then again,
in the past my early performance estimates were wrong quite often.
I decided to start working on a mockup (quickly thrown together).
I don't expect to have much use for it, but meh.
It works by packing/unpacking the values into an internal format along
vaguely similar lines to the .NET format, just bigger to accommodate
more digits:
4x 32-bit values each holding 9 digits
Except the top one generally holding 7 digits.
16-bit exponent, sign byte.
Then wrote a few pack/unpack scenarios:
X30: Directly packing 20/30 bit chunks, non-standard;
DPD: Use the DPD format;
BID: Use the BID format.
For the pack/unpack step (taken in isolation):
X30 is around 10x faster than either DPD or BID;
Both DPD and BID need a similar amount of time.
BID needs a bunch of 128-bit arithmetic handlers.
DPD needs a bunch of merge/split and table lookups.
Seems to mostly balance out in this case.
For DPD, merge is effectively:
Do the table lookups;
v=v0+(v1*1000)+(v2*1000000);
With a split step like:
v0=v;
v1=v/1000;
v0-=v1*1000;
v2=v1/1000;
v1-=v2*1000;
Then, use table lookups to go back to DPD.
Did look into possible faster ways of doing the splitting, but then
noted that have not yet found a faster way that gives correct results
(where one can assume the compiler already knows how to turn divide by
constant into multiply by reciprocal).
At first it seemed like a strong reason to favor X30 over either DPD
or BID. Except, that the cost of the ADD and MUL operations
effectively dwarf that of the pack/unpack operations, so the relative
cost difference between X30 and DPD may not matter much.
As is, it seems MUL and ADD being roughly 6x more than the cost of the
DPD pack/unpack steps.
So, it seems, while DPD pack/unpack isn't free, it is not something
that would lead to X30 being a decisive win either in terms of
performance.
It might make more sense, if supporting BID, to just do it as its own
thing (and embrace just using a bunch of 128-bit arithmetic, and a
128*128=>256 bit widening multiply, ...). Also, can note that the BID
case ends up needing a lot more clutter, mostly again because C lacks
native support for 128-bit arithmetic.
If working based on digit chunks, likely better to stick with DPD due
to less clutter, etc. Though, this part would be less bad if C had had
widespread support for 128-bit integers.
Though, in this case, the ADD and MUL operations currently work by
internally doubling the width and then narrowing the result after
normalization. This is slower, but could give exact results.
Though, still not complete nor confirmed to produce correct results.
But, yeah, might be more worthwhile to look into digit chunking:
12x 3 digits (16b chunk)
4x 9 digits (32b chunk)
2x 18 digits (64b chunk)
3x 12 digits (64b chunk)
Likely I think:
3 digits, likely slower because of needing significantly more operations;
9 digits, seemed sensible, option I went with, internal operations
fully fit within the limits of 64 bit arithmetic;
18 digits, possible, but runs into many cases internally that would
require using 128-bit arithmetic.
12 digits, fits more easily into 64-bit arithmetic, but would still
sometimes exceed it; and isn't that much more than 9 digits (but would
reduce the number of chunks needed from 4 to 3).
While 18 digits conceptually needs fewer abstract operations than 9
digits, it would suffer the drawback of many of these operations being
notably slower.
However, if running on RV64G with the standard ABI, it is likely the
9- digit case would also take a performance hit due to sign-extended
unsigned int (and needing to spend 2 shifts whenever zero-extending a
value).
With 3x 12 digits,while not exactly the densest scheme, leaves a
little more "working space" so would reduce cases which exceed the
limits of 64-bit arithmetic. Well, except multiply, where 24 > 18 ...
The main merit of 9 digit chunking here being that it fully stays
within the limits of 64-bit arithmetic (where multiply temporarily
widens to working with 18 digits, but then narrows back to 9 digit
chunks).
Also 9 digit chunking may be preferable when one has a faster
32*32=>64 bit multiplier, but 64*64=>128 is slower.
One other possibility could be to use BCD rather than chunking, but I
expect BCD emulation to be painfully slow in the absence of ISA level
helpers.
I don't know yet if my implementation of DPD is actually correct.
Seems Decimal128 DPD is obscure enough that I don't currently have any alternate options to confirm if my encoding is correct.
Here is an example value:
2DFFCC1AEB53B3FB_B4E262D0DAB5E680
Which, in theory, should resemble PI.
Annoyingly, it seems like pretty much everyone else either went with
BID, or with other non-standard Decimal encodings.
Can't seem to find:
Any examples of hard-coded numbers in this format on the internet;
Any obvious way to generate them involving "stuff I already have".
As, in, not going and using some proprietary IBM library or similar.
Also Grok wasn't much help here, just keeps trying to use Python's "decimal", which quickly becomes obvious is not using Decimal128 (much
less DPD), but seemingly some other 256-bit format.
And, Grok fails to notice that what it is saying is nowhere close to
correct in this case.
Neither DeepSeek nor QWen being much help either... Both just sort of go down a rabbit hole, and eventually fall back to "Here is how you might
go about trying to decode this format...".
Not helpful, I more would just want some way to confirm whether or not I
got the format correct.
Which is easier if one has some example numbers or something that they
can decode and verify the value, or something that is able to decode
these numbers (which isn't just trying to stupidly shove it into
Python's Decimal class...).
Looking around, there is Decimal128 support in MongoDB/BSON, PyArrow,
and Boost C++, but in these cases, less helpful because they went with BID.
...
Checking, after things a a little more complete, MHz for (millions of
times per second), on my desktop PC:
DPD Pack/Unpack: 63.7 MHz (58 cycles)
X30 Pack/Unpack: 567 MHz ( 7 cycles) ?...
FMUL (unwrap) : 21.0 MHz (176 cycles)
FADD (unwrap) : 11.9 MHz (311 cycles)
FDIV : 0.4 MHz (very slow; Newton Raphson)
FMUL (DPD) : 11.2 MHz (330 cycles)
FADD (DPD) : 8.6 MHz (430 cycles)
FMUL (X30) : 12.4 MHz (298 cycles)
FADD (X30) : 9.8 MHz (378 cycles)
The relative performance impact of the wrap/unwrap step is somewhat
larger than expected (vs the unwrapped case).
Though, there seems to only be a small difference here between DPD and
X30 (so, likely whatever is effecting performance here is not directly related to the cost of the pack/unpack process).
The wrapped cases basically just add a wrapper function that unpacks the input values to the internal format, and then re-packs the result.
For using the wrapped functions to estimate pack/unpack cost:
DPD cost: 51 cycles.
X30 cost: 41 cycles.
Not really a good way to make X30 much faster. It does pay for the cost
of dealing with the combination field.
Not sure why they would be so close:
DPD case does a whole lot of stuff;
X30 case is mostly some shifts and similar.
Though, in this case, it does use these functions by passing/returning structs by value. It is possible a by-reference design might be faster
in this case.
This could possibly be cheapened slightly by going to, say:
S.E13.M114
In effect trading off some exponent range for cheaper handling of the exponent.
Can note:
MUL and ADD use double-width internal mantissa, so should be accurate;
Current test doesn't implement rounding modes though, could do so.
Currently hard-wired at Round-Nearest-Even.
DIV uses Newton-Raphson
The process of converging is a lot more fiddly than with Binary FP.
Partly as the strategy for generating the initial guess is far less accurate.
So, it first uses a loop with hard-coded checks and scales to get it in
the general area, before then letting N-R take over. If the value isn't close enough (seemingly +/- 25% or so), N-R flies off into space.
Namely:
Exponent is wrong:
Scale by factors of 2 until correct;
Off by more than 50%, scale by +/- 25%;
Off by more than 25%, scale by +/- 12.5%;
Else: Good enough, let normal N-R take over.
Precondition step is usually simpler with Binary-FP as the initial guess
is usually within the correct range. So, one can use a single modified
N-R step (that undershoots) followed by letting N-R take over.
More of an issue though when the initial guess is "maybe within a factor
of 10" because the usual reciprocal-approximation strategy used for Binary-FP isn't quite as effective.
...
Still don't have a use-case, mostly just messing around with this...
Here is an example value:
2DFFCC1AEB53B3FB_B4E262D0DAB5E680
<snip>>
Here is an example value:<snip>
2DFFCC1AEB53B3FB_B4E262D0DAB5E680
I multiplied PI by 10^31 and ran it through the int to decimal-float converter. It should give the same sequence of digits although the
exponent may be off.
2e078c2aeb53b3fbb4e262d0dab5e680
The sequence of digits is the same, except it begins C2 instead of C1.
void engine(char *source)
{
void *insts[] = {&&add, &&load, &&ip, ...};
void **ip=compile_to_vm_code(source,insts);
goto *ip++;
add:
...
goto *ip++;
I don't know yet if my implementation of DPD is actually correct.
The constant ROM[specifier] seems to be the easiest way of taking
5-bits and converting it into a FP number. It was only a few weeks
ago that we changed the range from {-31..+31} to {-15.5..+15.5} as
this covers <slightly> more fp constant uses.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
No, and I defer to you, or others here, on how these features are
implemented, specifically whether code modification is required. I was
referring to features such as assigned goto in Fortran, and Alter goto
in Cobol.
On modern architectures higher-order functions are implemented with
indirect branches or indirect calls (depending on whether it's a
tail-call or not); likewise for method dispatch.
I do not know how Lisp, FORTRAN, Algol 60 and other early languages
with higher-order functions were implemented on architectures that do
not have indirect branches; but if the assigned goto was implemented
with self-modifying code, the call to a function in a variable was
probably implemented like that, too.
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
void engine(char *source)
{
void *insts[] = {&&add, &&load, &&ip, ...};
void **ip=compile_to_vm_code(source,insts);
goto *ip++;
add:
...
goto *ip++;
One problem with assigned GOTO is data flow analysis for a comiler.
Compilers typically break down structured control flow into GOTO
and then perform analysis. A label whose address is assigned
anywhere in the program unit to a variable must be considered to
be reachable by any GOTO to said variable, so any variable in that
piece of code must be in a known place (i.e. memory). If it
is kept in a register in some places that could jump to that
particular label, the contents of that register must be stored
to memory before the jump is executed. Alternatively, memory
allocation must make sure that the same register is always used.
This was probably less of a problem when assigned goto was invented
(I assume this was for FORTRAN 66)
when few varibles were kept in--- Synchronet 3.21a-Linux NewsLink 1.2
registers, and register allocation was in its infancy. Now, this is
a much bigger impediment to optimization.
In other words, assigned goto confuses both programmers and
compilers.
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
The constant ROM[specifier] seems to be the easiest way of taking
5-bits and converting it into a FP number. It was only a few weeks
ago that we changed the range from {-31..+31} to {-15.5..+15.5} as
this covers <slightly> more fp constant uses.
These days, I would assume that software would chose between a
ROM and random logic with a specification. I gave this a spin,
again using espresso, followed by Berkeley ABC.
5-bit FP constants in My 66000 are effectively sign + magnitude,
which makes the logic quite simple; the sign can be just passed
through. The equations (e7 down to e0 are exponent bits, m22 down
to m0 are mantissa bits) for converting are
e7 = (i4) | (i3) | (i2);
e6 = (!i4&!i3&!i2&i1) | (!i4&!i3&!i2&i0);
e5 = (!i4&!i3&!i2&i1) | (!i4&!i3&!i2&i0);
e4 = (!i4&!i3&!i2&i1) | (!i4&!i3&!i2&i0);
e3 = (!i4&!i3&!i2&i1) | (!i4&!i3&!i2&i0);
e2 = (!i4&!i3&!i2&i1) | (!i4&!i3&!i2&i0);
e1 = (!i3&!i2&i1) | (!i3&!i2&i0) | (i4);
e0 = (!i4&!i2&i1) | (!i4&i3);
m22 = (!i4&!i3&i1&i0) | (!i4&i2&i1) | (i4&i3) | (i3&i2);
m21 = (!i4&i3&i1) | (i4&i2) | (!i3&i2&i0);
m20 = (!i4&i3&i0) | (i4&i1);
m19 = (i4&i0);
Sign is separate and not shown, all other mantissa bits are
always zero. ABC, optimizing for area, turns into (in BLIF format,
which is halfway readable)
.model i2f
.inputs i4 i3 i2 i1 i0
.outputs e7 e6 e5 e4 e3 e2 e1 e0 m22 m21 m20 m19
.gate NOR2_X1 A1=i4 A2=i2 ZN=new_n18
.gate INV_X1 A=i3 ZN=new_n19
.gate NAND2_X1 A1=new_n18 A2=new_n19 ZN=e7
.gate INV_X1 A=i1 ZN=new_n21
.gate INV_X1 A=i0 ZN=new_n22
.gate AOI21_X1 A=e7 B1=new_n21 B2=new_n22 ZN=e6
.gate BUF_X1 A=e6 Z=e5
.gate BUF_X1 A=e6 Z=e4
.gate BUF_X1 A=e6 Z=e3
.gate BUF_X1 A=e6 Z=e2
.gate OR2_X1 A1=e6 A2=i4 ZN=e1
.gate INV_X1 A=i4 ZN=new_n29
.gate NAND2_X1 A1=new_n29 A2=i3 ZN=new_n30
.gate INV_X1 A=new_n18 ZN=new_n31
.gate OAI21_X1 A=new_n30 B1=new_n31 B2=new_n21 ZN=e0
.gate AOI21_X1 A=i2 B1=new_n19 B2=i0 ZN=new_n33
.gate NAND2_X1 A1=new_n29 A2=i1 ZN=new_n34
.gate OAI22_X1 A1=new_n33 A2=new_n34 B1=new_n19 B2=new_n18 ZN=m22
.gate AOI21_X1 A=i4 B1=new_n19 B2=i0 ZN=new_n36
.gate INV_X1 A=i2 ZN=new_n37
.gate OAI22_X1 A1=new_n36 A2=new_n37 B1=new_n30 B2=new_n21 ZN=m21
.gate OAI22_X1 A1=new_n30 A2=new_n22 B1=new_n29 B2=new_n21 ZN=m20
.gate NOR2_X1 A1=new_n29 A2=new_n22 ZN=m19
.end
The inverter gates on input bit are not needed when they come
from flip-flops, and I am also not sure the buffers are needed.
If both are taken out, 14 gates are left, which is not a lot
(I assume that this is smaller than a small ROM, but I don't know).
Anton Ertl wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
No, and I defer to you, or others here, on how these features are
implemented, specifically whether code modification is required. I was >> referring to features such as assigned goto in Fortran, and Alter goto
in Cobol.
On modern architectures higher-order functions are implemented with indirect branches or indirect calls (depending on whether it's a
tail-call or not); likewise for method dispatch.
I do not know how Lisp, FORTRAN, Algol 60 and other early languages
with higher-order functions were implemented on architectures that do
not have indirect branches; but if the assigned goto was implemented
with self-modifying code, the call to a function in a variable was
probably implemented like that, too.
What architecture cannot do an indirect branch, which I assume
means a branch/jump to a variable location in a register?
And how would the operating system on such a machine get programs running?
Even if an ISA did not have a JMP reg instruction one can create it
using CALL to copy the IP to the stack where you modify it and
RET to pop the new IP value.
What architecture cannot do an indirect branch, which I assume
means a branch/jump to a variable location in a register?
Even if an ISA did not have a JMP reg instruction one can create it
using CALL to copy the IP to the stack where you modify it and
RET to pop the new IP value.
EricP <ThatWouldBeTelling@thevillage.com> writes:
What architecture cannot do an indirect branch, which I assume
means a branch/jump to a variable location in a register?
Or, in case of the 6502, in memory.
I don't know of any architecture (except maybe some one-instruction proof-of-concepts) that does not have indirect branches in one form or another, but I am not that familiar with architectures from the 1950s
or some of the extremely deprived embedded-control processors.
Maybe the thing about self-modifying code was thrown in to taint the
assigned goto through guilt-by-association.
Even if an ISA did not have a JMP reg instruction one can create it
using CALL to copy the IP to the stack where you modify it and
RET to pop the new IP value.
In most cases that is possible (even if the return address is stored
in a register and not on the stack), but the return addresses might
live on a separate stack (IIRC the Intel 8008 or the 8080 has such a
stack), and the call might be the only thing that pushes on that
stack. But yes, in most cases, it's a good argument that even very
deprived processors usually have some form of indirect branching.
- anton
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
void engine(char *source)
{
void *insts[] = {&&add, &&load, &&ip, ...};
void **ip=compile_to_vm_code(source,insts);
goto *ip++;
add:
...
goto *ip++;
One problem with assigned GOTO is data flow analysis for a comiler.
Compilers typically break down structured control flow into GOTO
and then perform analysis. A label whose address is assigned
anywhere in the program unit to a variable must be considered to
be reachable by any GOTO to said variable, so any variable in that
piece of code must be in a known place (i.e. memory). If it
is kept in a register in some places that could jump to that
particular label, the contents of that register must be stored
to memory before the jump is executed. Alternatively, memory
allocation must make sure that the same register is always used.
This was probably less of a problem when assigned goto was invented
(I assume this was for FORTRAN 66) when few varibles were kept in
registers, and register allocation was in its infancy. Now, this is
a much bigger impediment to optimization.
Thomas Koenig <tkoenig@netcologne.de> posted:
This was probably less of a problem when assigned goto was invented
(I assume this was for FORTRAN 66)
I think FORTRAN 66 inherited from FORTRAN II or even FORTRAN (1),
it was available in WATFOR and WATFIV.
EricP <ThatWouldBeTelling@thevillage.com> posted:
Anton Ertl wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
No, and I defer to you, or others here, on how these features
are implemented, specifically whether code modification is
required. I was referring to features such as assigned goto in
Fortran, and Alter goto in Cobol.
On modern architectures higher-order functions are implemented
with indirect branches or indirect calls (depending on whether
it's a tail-call or not); likewise for method dispatch.
I do not know how Lisp, FORTRAN, Algol 60 and other early
languages with higher-order functions were implemented on
architectures that do not have indirect branches; but if the
assigned goto was implemented with self-modifying code, the call
to a function in a variable was probably implemented like that,
too.
What architecture cannot do an indirect branch, which I assume
means a branch/jump to a variable location in a register?
PDP-8,
4004,
IBM 650,
... And any machine without "registers".
I don't know of any architecture (except maybe some one-instruction >proof-of-concepts) that does not have indirect branches in one form or >another, but I am not that familiar with architectures from the 1950s
or some of the extremely deprived embedded-control processors.
Maybe the thing about self-modifying code was thrown in to taint the
assigned goto through guilt-by-association.
stack. But yes, in most cases, it's a good argument that even very
deprived processors usually have some form of indirect branching.
I would imagine that in old times return iinstruction was less common
than indirect addressing itself.
This was probably less of a problem when assigned goto was invented
(I assume this was for FORTRAN 66) ..
EricP <ThatWouldBeTelling@thevillage.com> posted:
Anton Ertl wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
No, and I defer to you, or others here, on how these features are
implemented, specifically whether code modification is required. I was >> >> referring to features such as assigned goto in Fortran, and Alter goto >> >> in Cobol.
On modern architectures higher-order functions are implemented with
indirect branches or indirect calls (depending on whether it's a
tail-call or not); likewise for method dispatch.
I do not know how Lisp, FORTRAN, Algol 60 and other early languages
with higher-order functions were implemented on architectures that do
not have indirect branches; but if the assigned goto was implemented
with self-modifying code, the call to a function in a variable was
probably implemented like that, too.
What architecture cannot do an indirect branch, which I assume
means a branch/jump to a variable location in a register?
PDP-8, 4004, IBM 650, ... And any machine without "registers".
According to Michael S <already5chosen@yahoo.com>:
I would imagine that in old times return iinstruction was less common
than indirect addressing itself.
On several of the machines I used a subroutine call stored the return
address in the first word of the routine and branched to that address+1.
The return was just an indirect jump.
Stacks? What's a stack? We barely had registers.
PDP-8, 4004, IBM 650, ... And any machine without "registers".
To be fair, addresses 10 through 17 in the PDP-8 were effectively >auto-increment registers and indirect branches were their
primary function. ....
On several of the machines I used a subroutine call stored the return
address in the first word of the routine and branched to that address+1.
The return was just an indirect jump.
Stacks? What's a stack? We barely had registers.
Yes, I saw the PDP-8 did that for JMS Jump Subroutine.
I've never used one but it looks like by playing with the
Indirect and Page-zero memory addressing options you could
treat page-zero a bit like a register bank,
but also store some short but critical routines in page-zero
to manually move the return PC to/from a stack.
And use indirect addressing to access its full sumptuous 4kW address space.
According to Michael S <already5chosen@yahoo.com>:
I would imagine that in old times return iinstruction was less common
than indirect addressing itself.
On several of the machines I used a subroutine call stored the return
address in the first word of the routine and branched to that address+1.
The return was just an indirect jump.
Stacks? What's a stack? We barely had registers.
According to Scott Lurndal <slp53@pacbell.net>:
PDP-8, 4004, IBM 650, ... And any machine without "registers".
To be fair, addresses 10 through 17 in the PDP-8 were effectively >auto-increment registers and indirect branches were their
primary function. ....
I did a fair amount of PDP-8 programming and I don't ever recall using
the auto-index locations for branches. They were used to step
through a table of data, e.g. to add up a list of numbers:
10, 1007 ; list starts at 1010
100, -50 ; list is 50 (octal long)
CLA
LOOP,
TAD I 10
ISZ 100
JMP LOOP
; sum is in the accumulator
I suppose you could use them for threaded code, but I didn't run into
any PDP-8 progams that used that.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
On 11/6/2025 11:38 AM, Thomas Koenig wrote:
[...]
Here is the head of an output of a little script I wrote to count
all floating-point constants from My66000 assembler. Note that
the compiler is for the version that does not yet do 0.5 etc as
floating point. The first number is the number of occurrences,
the second one is the constant itself.
5-bit constants: 886
32-bit constants: 566
64-bit constants:597
303 0
290 1
96 0.5
81 6
58 -1
58 1e-14
49 2
46 -2
45 -8.98846567431158e+307
44 10
44 255
37 8.98846567431158e+307
29 -0.5
28 3
27 90
27 360
26 -1e-05
21 0.0174532925199433
20 0.9
18 -3
17 180
17 0.1
17 0.01
[...]
Interesting! No values related to pi? And what are the ...e+307 used for?
If you loook closely, you'll see pi/180 in that list. But pi is
also there (I cut it off the list), it occurs 11 times. And the
large numbers are +/- DBL_MAX*0.5, I don't know what they are
used for.
By comparision, here are the values which are most frequently
contained in GSL:
5-bit constants: 5148
32-bit constants: 3769
64-bit constants:3140
2678 1
1518 0
687 -1
424 2
329 0.5
298 -2
291 2.22044604925031e-16
275 4.44089209850063e-16
273 3
132 -3
131 -0.5
131 3.14159265358979
88 4
86 1.34078079299426e+154
77 6
70 0.25
70 5
68 2.2250738585072e-308
66 10
64 -4
50 -6
46 0.1
45 5.87747175411144e-39
43 0.333333333333333
42 1e+50
38 6.28318530717959
35 9
31 0.2
30 7
30 -0.25
[...]
So, having values between -15.5 and +15.5 is a choice that will
cover quite a few floating point constants.
For different packages,
FP constant distributions probably vary too much to create something
that is much more useful.
BGB <cr88192@gmail.com> schrieb:
I don't know yet if my implementation of DPD is actually correct.
The POWER ISA has a pretty good description, see the OpenPower
foundation.
On 11/7/2025 9:29 AM, Thomas Koenig wrote:----------snip-------------
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
On 11/6/2025 11:38 AM, Thomas Koenig wrote:
I think there is some gain in object code size to be had for things like this, but it is probably modest.
One related question, and it is really a compiler question. Say I am writing a program and I know I will need the value of pi say 10 times in
the source code. I decide to make my coding easier, and the source code more compact by creating a constant, called PI, with a value of
3.14159..., then write the word PI instead of the numerical constant 10 times in the source code. Will/should the compiler generate inline immediates for the ten references or will it generate a load of the
actually constant variable? Tradeoffs either way.
On 11/8/2025 5:28 AM, Thomas Koenig wrote:
BGB <cr88192@gmail.com> schrieb:
I don't know yet if my implementation of DPD is actually correct.
The POWER ISA has a pretty good description, see the OpenPower
foundation.
Luckily, I have since figured it out and confirmed it.
Otherwise, fiddled with the division algorithm some more, and it is now "slightly less awful", and converges a bit faster...
Relatedly, also added Square-Root...
My previous strategies for square-root didn't really work as effectively
in this case, so just sorta fiddled with stuff until I got something
that worked...
Algorithm I came up with (to find sqrt(S)):
Make an initial guess of the square root, calling it C;
Make an initial guess for the reciprocal of C, calling it H;
Take a few passes (threading the needle, *1):
C[n+1]=C+(S-(C*c))*(H*0.375)
Redo approximate reciprocal of C, as H (*2);
Refine H: H=H*(2-C*H)
Enter main iteration pass:
C[+1]=C+(S-(C*c))*(H*0.5)
H[+1]=H*(2-C*H) //(*3)
*1: Usual "try to keep stuff from flying off into space" step, using a
scale of 0.375 to undershoot convergence and increase stability (lower
means more stability but slower convergence; closer to 0.5 means faster,
but more likely to "fly off into space" depending on the accuracy of the initial guesses).
*2: Seemed better to start over from a slightly better guess of C, than
to directly iterate from the initial (much less accurate) guess.
*3: Noting that if H is also converged, the convergence rate for C is significantly improved (the gains from faster C convergence are enough
to offset the added cost of also converging H).
Seems to be effective, though still slower than divide (which is still
23x slower than an ADD or MUL).
In this case, the more complex algorithm being (ironically) partly
justified by the comparably higher relative cost per operation (and the issue that I can't resort to tricks like handling the floating-point
values as integers; doesn't work so hot with Decimal128).
Felt curious, tried asking Grok about this, it identified this approach
as the Goldschmidt Algorithm, OK. If so, kinda weird that I arrived at a well known (?) algorithm mostly by fiddling with it.
Looking on Wikipedia though, this doesn't look like the same algorithm though.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 11/7/2025 9:29 AM, Thomas Koenig wrote:----------snip-------------
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
On 11/6/2025 11:38 AM, Thomas Koenig wrote:
I think there is some gain in object code size to be had for things like
this, but it is probably modest.
The gain in instruction count is constant (sic) since one can represent
any FP constant as an operand with 1 instruction--what we are striving
for is code footprint.
One related question, and it is really a compiler question. Say I am
writing a program and I know I will need the value of pi say 10 times in
the source code. I decide to make my coding easier, and the source code
more compact by creating a constant, called PI, with a value of
3.14159..., then write the word PI instead of the numerical constant 10
times in the source code. Will/should the compiler generate inline
immediates for the ten references or will it generate a load of the
actually constant variable? Tradeoffs either way.
The number of instructions executed will be exactly the same,
the size of
the code footprint will be lower if/when the compiler can figure out
when to allocate PI into a register for some duration.
Currently, a) if there are free registers, and b) the constant is used
3 times, you gain 1 word of code footprint.
but (BUT), c) if there are no free registers, and d) the constant is
used more than 6 times, you gain your first word of code footprint.
So, it is a bit tricky trading off instruction count for instruction footprint.
On 11/8/2025 5:28 AM, Thomas Koenig wrote:
BGB <cr88192@gmail.com> schrieb:
I don't know yet if my implementation of DPD is actually correct.
The POWER ISA has a pretty good description, see the OpenPower
foundation.
Luckily, I have since figured it out and confirmed it.
DIV uses Newton-RaphsonMy possibly naive idea would extract the top 9-15 digits from divisor
The process of converging is a lot more fiddly than with Binary FP.
Partly as the strategy for generating the initial guess is far less accurate.
So, it first uses a loop with hard-coded checks and scales to get it in
the general area, before then letting N-R take over. If the value isn't close enough (seemingly +/- 25% or so), N-R flies off into space.
Namely:
Exponent is wrong:
Scale by factors of 2 until correct;
Off by more than 50%, scale by +/- 25%;
Off by more than 25%, scale by +/- 12.5%;
Else: Good enough, let normal N-R take over.
EricP <ThatWouldBeTelling@thevillage.com> posted:
Anton Ertl wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
No, and I defer to you, or others here, on how these features are
implemented, specifically whether code modification is required. I was >>>> referring to features such as assigned goto in Fortran, and Alter goto >>>> in Cobol.
On modern architectures higher-order functions are implemented with
indirect branches or indirect calls (depending on whether it's a
tail-call or not); likewise for method dispatch.
I do not know how Lisp, FORTRAN, Algol 60 and other early languages
with higher-order functions were implemented on architectures that do
not have indirect branches; but if the assigned goto was implemented
with self-modifying code, the call to a function in a variable was
probably implemented like that, too.
What architecture cannot do an indirect branch, which I assume
means a branch/jump to a variable location in a register?
PDP-8, 4004, IBM 650, ... And any machine without "registers".
And how would the operating system on such a machine get programs running?
Load them at a known location and branch to the known location.
Even if an ISA did not have a JMP reg instruction one can create it
using CALL to copy the IP to the stack where you modify it and
RET to pop the new IP value.
Pure stack machines did a lot of this.
I suppose you could use them for threaded code, but I didn't run into
any PDP-8 progams that used that.
BGB <cr88192@gmail.com> posted:
On 11/8/2025 5:28 AM, Thomas Koenig wrote:
BGB <cr88192@gmail.com> schrieb:
I don't know yet if my implementation of DPD is actually correct.
The POWER ISA has a pretty good description, see the OpenPower
foundation.
Luckily, I have since figured it out and confirmed it.
Otherwise, fiddled with the division algorithm some more, and it is now
"slightly less awful", and converges a bit faster...
Relatedly, also added Square-Root...
My previous strategies for square-root didn't really work as effectively
in this case, so just sorta fiddled with stuff until I got something
that worked...
Algorithm I came up with (to find sqrt(S)):
Make an initial guess of the square root, calling it C;
Make an initial guess for the reciprocal of C, calling it H;
Take a few passes (threading the needle, *1):
C[n+1]=C+(S-(C*c))*(H*0.375)
Redo approximate reciprocal of C, as H (*2);
Refine H: H=H*(2-C*H)
Enter main iteration pass:
C[+1]=C+(S-(C*c))*(H*0.5)
H[+1]=H*(2-C*H) //(*3)
*1: Usual "try to keep stuff from flying off into space" step, using a
scale of 0.375 to undershoot convergence and increase stability (lower
means more stability but slower convergence; closer to 0.5 means faster,
but more likely to "fly off into space" depending on the accuracy of the
initial guesses).
*2: Seemed better to start over from a slightly better guess of C, than
to directly iterate from the initial (much less accurate) guess.
*3: Noting that if H is also converged, the convergence rate for C is
significantly improved (the gains from faster C convergence are enough
to offset the added cost of also converging H).
Seems to be effective, though still slower than divide (which is still
23x slower than an ADD or MUL).
SQRT should be 20%-30% slower than DIV.
In this case, the more complex algorithm being (ironically) partly
justified by the comparably higher relative cost per operation (and the
issue that I can't resort to tricks like handling the floating-point
values as integers; doesn't work so hot with Decimal128).
If you have binary SQRT and a quick way from DFP128 to BFP32, take SQRT
in binary, convert back and do 2 iterations. Should be faster. {{I need
to remind some folks that {float; float; FDIV; fix} was faster than
IDIV on many 2st generation RISC machines.
Felt curious, tried asking Grok about this, it identified this approach
as the Goldschmidt Algorithm, OK. If so, kinda weird that I arrived at a
well known (?) algorithm mostly by fiddling with it.
Feels like it is 1965--does it not ?!?
Looking on Wikipedia though, this doesn't look like the same algorithm
though.
Goldschmidt is just a N_R where the arithmetic has been arranged so
that multiplies are not data-dependent (like N-R). And for this
independence; GS lacks the automatic correction N-R has.
According to Scott Lurndal <slp53@pacbell.net>:
PDP-8, 4004, IBM 650, ... And any machine without "registers".
To be fair, addresses 10 through 17 in the PDP-8 were effectively >>auto-increment registers and indirect branches were their
primary function. ....
I did a fair amount of PDP-8 programming and I don't ever recall using
the auto-index locations for branches. They were used to step
through a table of data, e.g. to add up a list of numbers:
10, 1007 ; list starts at 1010
100, -50 ; list is 50 (octal long)
CLA
LOOP,
TAD I 10
ISZ 100
JMP LOOP
; sum is in the accumulator
I suppose you could use them for threaded code, but I didn't run into
any PDP-8 progams that used that.
I suppose you could use them for threaded code, but I didn't run into
any PDP-8 progams that used that.
Yes, mainly for data. I do have a vague recollection of hand[*]disassembling >the basic interpreter and finding some unexpected indirect branches through >010-017.
BGB wrote:
DIV uses Newton-Raphson
The process of converging is a lot more fiddly than with Binary FP.
Partly as the strategy for generating the initial guess is far less
accurate.
So, it first uses a loop with hard-coded checks and scales to get it
in the general area, before then letting N-R take over. If the value
isn't close enough (seemingly +/- 25% or so), N-R flies off into space.
Namely:
Exponent is wrong:
Scale by factors of 2 until correct;
Off by more than 50%, scale by +/- 25%;
Off by more than 25%, scale by +/- 12.5%;
Else: Good enough, let normal N-R take over.
My possibly naive idea would extract the top 9-15 digits from divisor
and dividend, convert both to binary FP, do the division and convert back.
That would reduce the NR step to two or three iterations, right?
On 11/10/2025 1:16 AM, Terje Mathisen wrote:That is your timing for Decimal128 on modern desktop PC?
BGB wrote:
DIV uses Newton-Raphson
The process of converging is a lot more fiddly than with Binary
FP. Partly as the strategy for generating the initial guess is far
less accurate.
So, it first uses a loop with hard-coded checks and scales to get
it in the general area, before then letting N-R take over. If the
value isn't close enough (seemingly +/- 25% or so), N-R flies off
into space.
Namely:
Exponent is wrong:
Scale by factors of 2 until correct;
Off by more than 50%, scale by +/- 25%;
Off by more than 25%, scale by +/- 12.5%;
Else: Good enough, let normal N-R take over.
My possibly naive idea would extract the top 9-15 digits from
divisor and dividend, convert both to binary FP, do the division
and convert back.
That would reduce the NR step to two or three iterations, right?
After adding code to feed to convert to/from 'double', and using this
for initial reciprocal and square-root:
DIV gets around 50% faster: ~ 1.5 MHz (~ 12x slower than MUL);
On Mon, 10 Nov 2025 13:54:23 -0600
BGB <cr88192@gmail.com> wrote:
On 11/10/2025 1:16 AM, Terje Mathisen wrote:
BGB wrote:
DIV uses Newton-Raphson
The process of converging is a lot more fiddly than with Binary
FP. Partly as the strategy for generating the initial guess is far
less accurate.
So, it first uses a loop with hard-coded checks and scales to get
it in the general area, before then letting N-R take over. If the
value isn't close enough (seemingly +/- 25% or so), N-R flies off
into space.
Namely:
Exponent is wrong:
Scale by factors of 2 until correct;
Off by more than 50%, scale by +/- 25%;
Off by more than 25%, scale by +/- 12.5%;
Else: Good enough, let normal N-R take over.
My possibly naive idea would extract the top 9-15 digits from
divisor and dividend, convert both to binary FP, do the division
and convert back.
That would reduce the NR step to two or three iterations, right?
After adding code to feed to convert to/from 'double', and using this
for initial reciprocal and square-root:
DIV gets around 50% faster: ~ 1.5 MHz (~ 12x slower than MUL);
That is your timing for Decimal128 on modern desktop PC?
Dependent divisions or independent?
Even for dependent, it sounds slow.
Did you try to compare against brute force calculation using GMP? https://gmplib.org/
I.e. asuming that num < den < 10*num use GMP to calculate 40 decimal
digits of intermediate result y as follows:
Numx = num * 1e40;
y = Numx/den;
Yi = y / 1e6, Yf = y % 1e6 (this step does not require GMP, figure out
why).
If Yf != 5e5 then you finished. Only in extremely rare case (1 in a
million) of Yf == 5e5 you will have to calculate reminder of Numx/den
to found correct rounding.
Somehow, I suspect that on modern PC even non-optimized method like
above will be faster tham 670 usec.
On 11/10/2025 4:08 PM, Michael S wrote:
On Mon, 10 Nov 2025 13:54:23 -0600
BGB <cr88192@gmail.com> wrote:
On 11/10/2025 1:16 AM, Terje Mathisen wrote:
BGB wrote:
DIV uses Newton-Raphson
The process of converging is a lot more fiddly than with Binary
FP. Partly as the strategy for generating the initial guess is
far less accurate.
So, it first uses a loop with hard-coded checks and scales to get
it in the general area, before then letting N-R take over. If the
value isn't close enough (seemingly +/- 25% or so), N-R flies off
into space.
Namely:
Exponent is wrong:
Scale by factors of 2 until correct;
Off by more than 50%, scale by +/- 25%;
Off by more than 25%, scale by +/- 12.5%;
Else: Good enough, let normal N-R take over.
My possibly naive idea would extract the top 9-15 digits from
divisor and dividend, convert both to binary FP, do the division
and convert back.
That would reduce the NR step to two or three iterations, right?
After adding code to feed to convert to/from 'double', and using
this for initial reciprocal and square-root:
DIV gets around 50% faster: ~ 1.5 MHz (~ 12x slower than
MUL);
That is your timing for Decimal128 on modern desktop PC?
Dependent divisions or independent?
Even for dependent, it sounds slow.
Modern-ish...
I am running a CPU type that was originally released 7 years ago,
with slower RAM than it was designed to work with.
Did you try to compare against brute force calculation using GMP? https://gmplib.org/
I.e. asuming that num < den < 10*num use GMP to calculate 40
decimal digits of intermediate result y as follows:
Numx = num * 1e40;
y = Numx/den;
Yi = y / 1e6, Yf = y % 1e6 (this step does not require GMP, figure
out why).
If Yf != 5e5 then you finished. Only in extremely rare case (1 in a million) of Yf == 5e5 you will have to calculate reminder of
Numx/den to found correct rounding.
Somehow, I suspect that on modern PC even non-optimized method like
above will be faster tham 670 usec.
Well, first step is building with GCC rather than MSVC...
It would appear that it gets roughly 79% faster when built with GCC.
So, around 2 million divides per second.
As for GMP, dividing two 40 digit numbers:I want you to measure division of 74-digit integer by 34-digit integer,
22 million per second.
If I do both a divide and a remainder:
16 million.
I don't really get what you are wanting me to measure exactly
though...
If I compare against the IBM decNumber library:
Multiply: 14 million.
Divide: 7 million
The decNumber library doesn't appear to have a square-root function...
Granted, there are possibly faster ways to do divide, versus using Newton-Raphson in this case...
It was not the point that I could pull the fastest possible
implementation out of thin air. But, does appear I am beating
decNumber at least for multiply performance and similar.
On Mon, 10 Nov 2025 21:25:47 -0600
BGB <cr88192@gmail.com> wrote:
On 11/10/2025 4:08 PM, Michael S wrote:
On Mon, 10 Nov 2025 13:54:23 -0600
BGB <cr88192@gmail.com> wrote:
On 11/10/2025 1:16 AM, Terje Mathisen wrote:
BGB wrote:
DIV uses Newton-Raphson
The process of converging is a lot more fiddly than with Binary
FP. Partly as the strategy for generating the initial guess is
far less accurate.
So, it first uses a loop with hard-coded checks and scales to get
it in the general area, before then letting N-R take over. If the
value isn't close enough (seemingly +/- 25% or so), N-R flies off
into space.
Namely:
Exponent is wrong:
Scale by factors of 2 until correct;
Off by more than 50%, scale by +/- 25%;
Off by more than 25%, scale by +/- 12.5%;
Else: Good enough, let normal N-R take over.
My possibly naive idea would extract the top 9-15 digits from
divisor and dividend, convert both to binary FP, do the division
and convert back.
That would reduce the NR step to two or three iterations, right?
After adding code to feed to convert to/from 'double', and using
this for initial reciprocal and square-root:
DIV gets around 50% faster: ~ 1.5 MHz (~ 12x slower than
MUL);
That is your timing for Decimal128 on modern desktop PC?
Dependent divisions or independent?
Even for dependent, it sounds slow.
Modern-ish...
Zen2 ?
I consider it the last of non-modern. Zen3 and Ice Lake are first
of modern. 128by64 bit integer division on Zen2 is still quite slow
and overall uArch is even less advanced than 10 y.o. Intel Skylake.
In majority of real-world workloads it's partially compensated by
Zen2 bigger L3 cache. In our case big cache does not help.
But even last non-modern CPU shall be capable to divide faster than
suggested by your numbers.
I am running a CPU type that was originally released 7 years ago,
with slower RAM than it was designed to work with.
Did you try to compare against brute force calculation using GMP?
https://gmplib.org/
I.e. asuming that num < den < 10*num use GMP to calculate 40
decimal digits of intermediate result y as follows:
Numx = num * 1e40;
y = Numx/den;
Yi = y / 1e6, Yf = y % 1e6 (this step does not require GMP, figure
out why).
If Yf != 5e5 then you finished. Only in extremely rare case (1 in a
million) of Yf == 5e5 you will have to calculate reminder of
Numx/den to found correct rounding.
Somehow, I suspect that on modern PC even non-optimized method like
above will be faster tham 670 usec.
Well, first step is building with GCC rather than MSVC...
It would appear that it gets roughly 79% faster when built with GCC.
So, around 2 million divides per second.
As for GMP, dividing two 40 digit numbers:
22 million per second.
If I do both a divide and a remainder:
16 million.
I don't really get what you are wanting me to measure exactly
though...
I want you to measure division of 74-digit integer by 34-digit integer, because it is the slowest part [of brute force implementation] of
Decimal128 division. The rest of division is approximately the same as multiplication.
So, [unoptimized] Decimal128 division time should be no worse than
t1+t2, where t1 is duration of Decimal128 multiplication and t2 is
duration of above-mentioned integer division. An estimate is
pessimistic, because post-division normalization tends to be simpler
than post-multiplication normalization.
Optimized division would be faster yet.
If I compare against the IBM decNumber library:
Multiply: 14 million.
Divide: 7 million
The decNumber library doesn't appear to have a square-root function...
Granted, there are possibly faster ways to do divide, versus using
Newton-Raphson in this case...
It was not the point that I could pull the fastest possible
implementation out of thin air. But, does appear I am beating
decNumber at least for multiply performance and similar.
On 11/11/2025 4:02 AM, Michael S wrote:
On Mon, 10 Nov 2025 21:25:47 -0600
BGB <cr88192@gmail.com> wrote:
On 11/10/2025 4:08 PM, Michael S wrote:
On Mon, 10 Nov 2025 13:54:23 -0600
BGB <cr88192@gmail.com> wrote:
On 11/10/2025 1:16 AM, Terje Mathisen wrote:
BGB wrote:
DIV uses Newton-Raphson
The process of converging is a lot more fiddly than with Binary
FP. Partly as the strategy for generating the initial guess is
far less accurate.
So, it first uses a loop with hard-coded checks and scales to
get it in the general area, before then letting N-R take over.
If the value isn't close enough (seemingly +/- 25% or so), N-R
flies off into space.
Namely:
Exponent is wrong:
Scale by factors of 2 until correct;
Off by more than 50%, scale by +/- 25%;
Off by more than 25%, scale by +/- 12.5%;
Else: Good enough, let normal N-R take over.
My possibly naive idea would extract the top 9-15 digits from
divisor and dividend, convert both to binary FP, do the division
and convert back.
That would reduce the NR step to two or three iterations, right?
After adding code to feed to convert to/from 'double', and using
this for initial reciprocal and square-root:
DIV gets around 50% faster: ~ 1.5 MHz (~ 12x slower than
MUL);
That is your timing for Decimal128 on modern desktop PC?
Dependent divisions or independent?
Even for dependent, it sounds slow.
Modern-ish...
Zen2 ?
I consider it the last of non-modern. Zen3 and Ice Lake are first
of modern. 128by64 bit integer division on Zen2 is still quite slow
and overall uArch is even less advanced than 10 y.o. Intel Skylake.
In majority of real-world workloads it's partially compensated by
Zen2 bigger L3 cache. In our case big cache does not help.
But even last non-modern CPU shall be capable to divide faster than suggested by your numbers.
Zen+
Or, a slightly tweaked version of Zen1.
It is very well possible to do big integer divide faster than this.
Such as via shift-and-add.
But, as for decimal, this makes it harder.
I could do long division, but this is a much more complicated
algorithm (versus using Newton-Raphson).
But, N-R is slow as it is basically a bunch of operations, which are
granted themselves, each kinda slow.
I am running a CPU type that was originally released 7 years ago,
with slower RAM than it was designed to work with.
Did you try to compare against brute force calculation using GMP?
https://gmplib.org/
I.e. asuming that num < den < 10*num use GMP to calculate 40
decimal digits of intermediate result y as follows:
Numx = num * 1e40;
y = Numx/den;
Yi = y / 1e6, Yf = y % 1e6 (this step does not require GMP, figure
out why).
If Yf != 5e5 then you finished. Only in extremely rare case (1 in
a million) of Yf == 5e5 you will have to calculate reminder of
Numx/den to found correct rounding.
Somehow, I suspect that on modern PC even non-optimized method
like above will be faster tham 670 usec.
Well, first step is building with GCC rather than MSVC...
It would appear that it gets roughly 79% faster when built with
GCC. So, around 2 million divides per second.
As for GMP, dividing two 40 digit numbers:
22 million per second.
If I do both a divide and a remainder:
16 million.
I don't really get what you are wanting me to measure exactly
though...
Certainly not via GMP in final product. But doing 1st version via GMPI want you to measure division of 74-digit integer by 34-digit
integer, because it is the slowest part [of brute force
implementation] of Decimal128 division. The rest of division is approximately the same as multiplication.
So, [unoptimized] Decimal128 division time should be no worse than
t1+t2, where t1 is duration of Decimal128 multiplication and t2 is
duration of above-mentioned integer division. An estimate is
pessimistic, because post-division normalization tends to be simpler
than post-multiplication normalization.
Optimized division would be faster yet.
If it is a big-integer divide, this is not quite the same thing.
And, if I were to use big-integer divide (probably not via GMP
though,
this would be too big of a dependency), there is still theNo, no, no. Not "group of 9 digits"! Plain unadulterated binary. 64
issue of efficiently converting between big-integer and the "groups
of 9 digits in 32-bits" format.
This is partly why I removed the BID code:
At first, it seemed like the DPD and BID converters were similar
speed; But, turns out I was still testing the DPD converter, and
in-fact the BID converter was significantly slower.
And, if I were going to do BID, would make more sense to do it as its
own thing, and build it mostly around 128-bit integer math.
But, in this case, I had decided to experiment with DPD.
Most likely, in this case if I wanted faster divide, that also played
well with the existing format, I would need to do long division or
similar.
If I compare against the IBM decNumber library:
Multiply: 14 million.
Divide: 7 million
The decNumber library doesn't appear to have a square-root
function...
Granted, there are possibly faster ways to do divide, versus using
Newton-Raphson in this case...
It was not the point that I could pull the fastest possible
implementation out of thin air. But, does appear I am beating
decNumber at least for multiply performance and similar.
Can note that while decNumber exists, at the moment, it is over 10x
more code...
Niklas Holsti <niklas.holsti@tidorum.invalid> posted:
On 2025-11-05 23:28, MitchAlsup wrote:
----------------
Niklas Holsti <niklas.holsti@tidorum.invalid> posted:
But then you could get the problem of a longjmp to a setjmp value that >>>> is stale because the targeted function invocation (stack frame) is no
longer there.
But YOU had to pass the jumpbuf out of the setjump() scope.
Now, YOU complain there is a hole in your own foot with a smoking gun
in your own hand.
That is not the issue. The question is if the semantics of "goto
label-valued-variable" are hard to define, as Ritchie said, or not, as
Anton thinks Stallman said or would have said.
So, label-variables are hard to define, but function-variables are not ?!?
According to Michael S <already5chosen@yahoo.com>:
I would imagine that in old times return iinstruction was less common
than indirect addressing itself.
On several of the machines I used a subroutine call stored the return
address in the first word of the routine and branched to that address+1.
The return was just an indirect jump.
Stacks? What's a stack? We barely had registers.And indeed the Algol 60 compiler for the HP 2100 did not support
On 2025-11-08 23:08, John Levine wrote:
According to Michael S <already5chosen@yahoo.com>:
I would imagine that in old times return iinstruction was less common
than indirect addressing itself.
On several of the machines I used a subroutine call stored the return
address in the first word of the routine and branched to that address+1.
The return was just an indirect jump.
One such machine was the HP 2100; I used some of those.
Stacks? What's a stack? We barely had registers.And indeed the Algol 60 compiler for the HP 2100 did not support
recursion. My programs did real-time control, so I wrote a small >non-preemptive but priority-driven multi-threading kernel. Thread switch
was easy as there were very few registers and no stack. But you had to
be careful because no subroutines were re-entrant.
Speaking of indirect addressing, the HP 2100 had a special feature: it
had a 64 KB address space, but with word addressing of 16-bit words, so >addresses were only 15 bits, leaving the MSbit in each word free.
When using indirect addressing there was an "indirect" bit in the >instruction which, in the usual way, made the machine use the 16-bit
content of the (directly) addressed word as the actual target address,
but only if the MSbit of that content was zero. If the MSbit was one, it >caused a further level of indirection, using the 15 other bits as the >address of another word that again would contain the actual target
address, if the MSbit of /that/ content was zero, and so on.
So an indirect instruction could cause a chain of indirections which
ended when an address-word had a zero in its MSbit. And the machine
could get stuck in an eternal indirection loop, which IIRC happened to
me once :-)
On 2025-11-06 20:28, MitchAlsup wrote:
Niklas Holsti <niklas.holsti@tidorum.invalid> posted:
On 2025-11-05 23:28, MitchAlsup wrote:
----------------
Niklas Holsti <niklas.holsti@tidorum.invalid> posted:
But then you could get the problem of a longjmp to a setjmp value that >>>>> is stale because the targeted function invocation (stack frame) is no >>>>> longer there.
But YOU had to pass the jumpbuf out of the setjump() scope.
Now, YOU complain there is a hole in your own foot with a smoking gun
in your own hand.
That is not the issue. The question is if the semantics of "goto
label-valued-variable" are hard to define, as Ritchie said, or not, as
Anton thinks Stallman said or would have said.
So, label-variables are hard to define, but function-variables are not
?!?
Depends on the level at which you want to define it.
At the machine level, where semantics are (usually) defined for each instruction separately, a jump to a dynamic address (using a "label-variable") is not much different from a call to a dynamic address (using a "function-variable"), and the effect of the single instruction
on the machine state is much the same as for the static address case.
The higher-level effect on the further execution of the program is out
of scope, whatever the actual value of the target address in the instruction.
It is only if your machine has some semantics for instruction
combinations, such as your VEC-LOOP pair, that you have to define what happens if a jump or call to some address leads to later executing only
some of those instructions or executing them in the wrong order, such as trying to execute a LOOP without having executed a preceding VEC.
At the higher programming-language level, the label case can be much
harder to define and less useful than the function case, depending on
the programming language and its abstract model of execution, and also depending on what compile-time checks you assume.
Consider an imperative language such as C with no functions nested
within other functions or other blocks (where by "block" I mean some syntactical construct that sets up its local context with local
variables etc.). If you have a function-variable (that is, a pointer to
a function) that actually refers to a function with the same parameter profile, it is easy to define the semantics of a call via this function variable: it is the same as for a call that names the referenced
function statically, and such a call is always legal. Problems arise
only if the function-variable has some invalid value such as NULL, or
the address of a function with a different profile, or some code address that does not refer to (the start of) a function. Such invalid values
can be prevented at compile time, except (usually) for NULL.
In the same language setting, the semantics of a jump using a
label-variable are easy to define only if the label-variable refers to a label in the same block as the jump. A jump from one block into another would mess up the context, omitting the set-up of the target block's
context and/or omitting the tear-down of the source block's context. The further results of program execution are machine-dependent and so
undefined behavior.
A compiler could enforce the label-in-same-block rule, but it seems that
GNU C does not do so.
In a programming language that allows nested functions the same kind of context-crossing problems arise for function-variables. Traditional languages solve them by allowing, at compile-time, calls via function-variables only if it is certain that the containing context of
the callee still exists (if the callee is nested), or by (expensively) preserving that context as a dynamically constructed closure. In either case, the caller's context never needs to be torn down to execute the
call, differing from the jump case.
In summary, jumps via label-variables are useful only for control
transfers within one function, and do not help to build up a computation
by combining several functions -- the main method of program design at present. In contrast, calls via function-variables are a useful
extension to static calls, actually helping to combine several functions
in a computation, as shown by the general adoption of
class/object/method coding styles.
Niklas
Typical process for NaN boxing is to set the high order bits of the
value which causes the value to appear to be a NaN at higher precision.
I have been thinking about using some of the high order bits of the NaN
(eg bits 32 to 51) to indicate the precision of the boxed value.
This
would allow detection of the use of a lower precision value in
arithmetic. Suppose a convert from single to double precision is being
done, but the value to be converted is only half precision. If it were indicated by the NaN software might be able to fix the result.
I also preserve the sign bit of the number in the NaN box.--- Synchronet 3.21a-Linux NewsLink 1.2
On 2025-11-06 20:28, MitchAlsup wrote:
Niklas Holsti <niklas.holsti@tidorum.invalid> posted:
On 2025-11-05 23:28, MitchAlsup wrote:
----------------
Niklas Holsti <niklas.holsti@tidorum.invalid> posted:
But then you could get the problem of a longjmp to a setjmp value that >>>> is stale because the targeted function invocation (stack frame) is no >>>> longer there.
But YOU had to pass the jumpbuf out of the setjump() scope.
Now, YOU complain there is a hole in your own foot with a smoking gun
in your own hand.
That is not the issue. The question is if the semantics of "goto
label-valued-variable" are hard to define, as Ritchie said, or not, as
Anton thinks Stallman said or would have said.
So, label-variables are hard to define, but function-variables are not ?!?
Depends on the level at which you want to define it.
At the machine level, where semantics are (usually) defined for each instruction separately, a jump to a dynamic address (using a "label-variable") is not much different from a call to a dynamic address (using a "function-variable"), and the effect of the single instruction
on the machine state is much the same as for the static address case.
The higher-level effect on the further execution of the program is out
of scope, whatever the actual value of the target address in the instruction.
It is only if your machine has some semantics for instruction
combinations, such as your VEC-LOOP pair, that you have to define what happens if a jump or call to some address leads to later executing only
some of those instructions or executing them in the wrong order, such as trying to execute a LOOP without having executed a preceding VEC.
At the higher programming-language level, the label case can be much
harder to define and less useful than the function case, depending on
the programming language and its abstract model of execution, and also depending on what compile-time checks you assume.
Consider an imperative language such as C with no functions nested
within other functions or other blocks (where by "block" I mean some syntactical construct that sets up its local context with local
variables etc.). If you have a function-variable (that is, a pointer to
a function) that actually refers to a function with the same parameter profile,
it is easy to define the semantics of a call via this function variable: it is the same as for a call that names the referenced
function statically, and such a call is always legal. Problems arise
only if the function-variable has some invalid value such as NULL, or
the address of a function with a different profile, or some code address that does not refer to (the start of) a function. Such invalid values
can be prevented at compile time, except (usually) for NULL.
In the same language setting, the semantics of a jump using a
label-variable are easy to define only if the label-variable refers to a label in the same block as the jump. A jump from one block into another would mess up the context, omitting the set-up of the target block's
context and/or omitting the tear-down of the source block's context. The further results of program execution are machine-dependent and so
undefined behavior.
A compiler could enforce the label-in-same-block rule, but it seems that
GNU C does not do so.
In a programming language that allows nested functions the same kind of context-crossing problems arise for function-variables. Traditional languages solve them by allowing, at compile-time, calls via function-variables only if it is certain that the containing context of
the callee still exists (if the callee is nested), or by (expensively) preserving that context as a dynamically constructed closure. In either case, the caller's context never needs to be torn down to execute the
call, differing from the jump case.
In summary, jumps via label-variables are useful only for control
transfers within one function, and do not help to build up a computation
by combining several functions -- the main method of program design at present. In contrast, calls via function-variables are a useful
extension to static calls, actually helping to combine several functions
in a computation, as shown by the general adoption of
class/object/method coding styles.
Niklas
Test1 allocates a dynamic sized buffer and has a static goto Loop
for which GCC generates a jne .L6 to a mov rsp, rbx that recovers
the stack allocation inside the {} block.
Test2 is the same but does a goto *dest and GCC does not generate
code to recover the inner {} block allocation. It just loops over
the sub rsp, rbx so the stack space just grows.
void Test2 (long len)
{
long ok;
void *dest;
dest = &&Loop;
Loop:
{
char buf[len];
ok = Sub (len, buf);
if (ok)
goto *dest;
}
}
Test2(long):
push rbp
mov rbp, rsp
push r12
mov r12, rdi
push rbx
lea rbx, [rdi+15]
shr rbx, 4
sal rbx, 4
.L8:
sub rsp, rbx
mov rdi, r12
mov rsi, rsp
call Sub(long, char*)
test rax, rax
jne .L8
lea rsp, [rbp-16]
pop rbx
pop r12
pop rbp
ret
jmp .L2...
.L6:
mov rsp, rbx
.L2:
jne .L6
Speaking of indirect addressing, the HP 2100 had a special feature: it
had a 64 KB address space, but with word addressing of 16-bit words, so >addresses were only 15 bits, leaving the MSbit in each word free.
[multi-level indirect chains]
Typical process for NaN boxing is to set the high order bits of the
value which causes the value to appear to be a NaN at higher precision.
I have been thinking about using some of the high order bits of the NaN
(eg bits 32 to 51) to indicate the precision of the boxed value. This
would allow detection of the use of a lower precision value in
arithmetic. Suppose a convert from single to double precision is being
done, but the value to be converted is only half precision.
If it were
indicated by the NaN software might be able to fix the result.
I also--
preserve the sign bit of the number in the NaN box.
Niklas Holsti <niklas.holsti@tidorum.invalid> posted:
It is only if your machine has some semantics for instruction
combinations, such as your VEC-LOOP pair, that you have to define what
happens if a jump or call to some address leads to later executing only
some of those instructions or executing them in the wrong order, such as
trying to execute a LOOP without having executed a preceding VEC.
BTW, encountering a LOOP without encountering a VEC is a natural
occurrence when returning from exception or interrupt. The VEC
register points at the VEC+1 instruction which is easy to return
to the VEC instruction.
According to Niklas Holsti <niklas.holsti@tidorum.invalid>:
Speaking of indirect addressing, the HP 2100 had a special feature: it
had a 64 KB address space, but with word addressing of 16-bit words, so
addresses were only 15 bits, leaving the MSbit in each word free.
[multi-level indirect chains]
That was quite common back in the day.
The Data General Nova and Varian 620i (both popular for OEM
applications) did exactly the same thing, 15 bit addresses with the
high bit saying indirect.
The PDP-6/10 was a 36 bit machine with 18 bit addresses and a rather overimplemented addressing scheme -- each instruction had an address, an indirect bit, and an index register, so it added the address to the index register (if the register number wasn't zero), then if the indirect bit was set,
fetch the addressed word and interpret its address, indirect bit, and index register the same way, ad infinitum.
An interesting question is what happened if a computer got into an indirect loop.
On 11/11/2025 11:46 AM, MitchAlsup wrote:
Niklas Holsti <niklas.holsti@tidorum.invalid> posted:
snip
It is only if your machine has some semantics for instruction
combinations, such as your VEC-LOOP pair, that you have to define what
happens if a jump or call to some address leads to later executing only
some of those instructions or executing them in the wrong order, such as >> trying to execute a LOOP without having executed a preceding VEC.
BTW, encountering a LOOP without encountering a VEC is a natural
occurrence when returning from exception or interrupt. The VEC
register points at the VEC+1 instruction which is easy to return
to the VEC instruction.
OK, but what if, say through an errant pointer, the code, totally
unrelated to the VEC, jumps somewhere in the middle of a VEC/LOOP pair?
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 11/11/2025 11:46 AM, MitchAlsup wrote:
Niklas Holsti <niklas.holsti@tidorum.invalid> posted:
snip
It is only if your machine has some semantics for instruction
combinations, such as your VEC-LOOP pair, that you have to define what >>>> happens if a jump or call to some address leads to later executing only >>>> some of those instructions or executing them in the wrong order, such as >>>> trying to execute a LOOP without having executed a preceding VEC.
BTW, encountering a LOOP without encountering a VEC is a natural
occurrence when returning from exception or interrupt. The VEC
register points at the VEC+1 instruction which is easy to return
to the VEC instruction.
OK, but what if, say through an errant pointer, the code, totally
unrelated to the VEC, jumps somewhere in the middle of a VEC/LOOP pair?
All taken branches clear the V-bit associated with vectorization.
So encountering the LOOP instruction would raise an exception.
Flow control WITHIN a VEC-LOOP pair is by predication-only.
Exception Control Transfer is special in this regards.
EricP <ThatWouldBeTelling@thevillage.com> writes:
Test1 allocates a dynamic sized buffer and has a static goto Loop
for which GCC generates a jne .L6 to a mov rsp, rbx that recovers
the stack allocation inside the {} block.
Test2 is the same but does a goto *dest and GCC does not generate
code to recover the inner {} block allocation. It just loops over
the sub rsp, rbx so the stack space just grows.
Interestingly, gcc optimizes the indirect branch with a constant
target into a direct branch, but then does not continue with the same
code as you get with a plain goto.
void Test2 (long len)
{
long ok;
void *dest;
dest = &&Loop;
Loop:
{
char buf[len];
ok = Sub (len, buf);
if (ok)
goto *dest;
}
}
Test2(long):
push rbp
mov rbp, rsp
push r12
mov r12, rdi
push rbx
lea rbx, [rdi+15]
shr rbx, 4
sal rbx, 4
.L8:
sub rsp, rbx
mov rdi, r12
mov rsi, rsp
call Sub(long, char*)
test rax, rax
jne .L8
lea rsp, [rbp-16]
pop rbx
pop r12
pop rbp
ret
Interesting that this bug has not been fixed in the >33 years that labels-as-values have been in gcc; I don't know how long these
dynamically sized arrays have been in gcc, but IIRC alloca(), a
similar feature, has been available at least as long as
labels-as-values. The bug has apparently been avoided or worked
around by the users of labels-as-values (e.g., Gforth does not use
alloca or dynamically-sized arrays in the function that contains all
the taken labels and all the "goto *"s.
As long as all taken labels have the same stack depth, the bugfix does
not look particularly hard: just put code before each goto * that
adjusts the stack depth to the depth of these labels.
Things become more interesting if there are labels with different
stack depths, because labels are stored in "void *" variables, and
there is not enough room for a target and a stack depth. One can ue
the same approach as is used in Test1, however: have the stack depth
for a specific target in some location, and have a copy from that
location to the stack pointer right behind the label.
....
jmp .L2....
.L6:
mov rsp, rbx
.L2:
jne .L6
All the code that works now would not need these extra copy
intructions, so the bugfix should special-case the case where all the
targets have the same depth.
- anton
Robert Finch <robfi680@gmail.com> posted:
Typical process for NaN boxing is to set the high order bits of the
value which causes the value to appear to be a NaN at higher precision.
Any FP value representable in lower precision can be exactly represented
in higher precision.
I have been thinking about using some of the high order bits of the NaN
(eg bits 32 to 51) to indicate the precision of the boxed value.
When My 66000 generates a NaN it inserts the cause in the 3 HoBs and
inserts IP in the LoBs. Nothing prevents you from overwriting the NaN,
but I thought it was best to point at the causing-instruction and an
encoded "why" the nan was generated. The cause is a 3-bit index to the
7 defined IEEE exceptions.
There are rules when more than 1 NaN are an operand to an instruction designed to leave the more important NaN as the result. {Where more
important is generally the first to be generated.}
This
would allow detection of the use of a lower precision value in
arithmetic. Suppose a convert from single to double precision is being
done, but the value to be converted is only half precision. If it were
indicated by the NaN software might be able to fix the result.
I think it is better to fix the SW that thinks a (half) is a (float).
preserve the sign bit of the number in the NaN box.
Robert Finch <robfi680@gmail.com> schrieb:
Typical process for NaN boxing is to set the high order bits of the
value which causes the value to appear to be a NaN at higher precision.
I have been thinking about using some of the high order bits of the NaN
(eg bits 32 to 51) to indicate the precision of the boxed value. This
would allow detection of the use of a lower precision value in
arithmetic. Suppose a convert from single to double precision is being
done, but the value to be converted is only half precision.
Do you mean a type mismatch, a conversion, or digits lost due to cancellation?
If it were
indicated by the NaN software might be able to fix the result.
Fixing a result after an NaN has occurred is too late, I think.
I also
preserve the sign bit of the number in the NaN box.
On Tue, 11 Nov 2025 04:44:48 -0600
BGB <cr88192@gmail.com> wrote:
On 11/11/2025 4:02 AM, Michael S wrote:
On Mon, 10 Nov 2025 21:25:47 -0600
BGB <cr88192@gmail.com> wrote:
On 11/10/2025 4:08 PM, Michael S wrote:
On Mon, 10 Nov 2025 13:54:23 -0600
BGB <cr88192@gmail.com> wrote:
On 11/10/2025 1:16 AM, Terje Mathisen wrote:
BGB wrote:
DIV uses Newton-Raphson
The process of converging is a lot more fiddly than with Binary >>>>>>>> FP. Partly as the strategy for generating the initial guess is >>>>>>>> far less accurate.
So, it first uses a loop with hard-coded checks and scales to
get it in the general area, before then letting N-R take over. >>>>>>>> If the value isn't close enough (seemingly +/- 25% or so), N-R >>>>>>>> flies off into space.
Namely:
Exponent is wrong:
Scale by factors of 2 until correct;
Off by more than 50%, scale by +/- 25%;
Off by more than 25%, scale by +/- 12.5%;
Else: Good enough, let normal N-R take over.
My possibly naive idea would extract the top 9-15 digits from
divisor and dividend, convert both to binary FP, do the division >>>>>>> and convert back.
That would reduce the NR step to two or three iterations, right? >>>>>>>
After adding code to feed to convert to/from 'double', and using
this for initial reciprocal and square-root:
DIV gets around 50% faster: ~ 1.5 MHz (~ 12x slower than
MUL);
That is your timing for Decimal128 on modern desktop PC?
Dependent divisions or independent?
Even for dependent, it sounds slow.
Modern-ish...
Zen2 ?
I consider it the last of non-modern. Zen3 and Ice Lake are first
of modern. 128by64 bit integer division on Zen2 is still quite slow
and overall uArch is even less advanced than 10 y.o. Intel Skylake.
In majority of real-world workloads it's partially compensated by
Zen2 bigger L3 cache. In our case big cache does not help.
But even last non-modern CPU shall be capable to divide faster than
suggested by your numbers.
Zen+
Or, a slightly tweaked version of Zen1.
It is very well possible to do big integer divide faster than this.
Such as via shift-and-add.
But, as for decimal, this makes it harder.
I could do long division, but this is a much more complicated
algorithm (versus using Newton-Raphson).
But, N-R is slow as it is basically a bunch of operations, which are
granted themselves, each kinda slow.
I am running a CPU type that was originally released 7 years ago,
with slower RAM than it was designed to work with.
Did you try to compare against brute force calculation using GMP?
https://gmplib.org/
I.e. asuming that num < den < 10*num use GMP to calculate 40
decimal digits of intermediate result y as follows:
Numx = num * 1e40;
y = Numx/den;
Yi = y / 1e6, Yf = y % 1e6 (this step does not require GMP, figure
out why).
If Yf != 5e5 then you finished. Only in extremely rare case (1 in
a million) of Yf == 5e5 you will have to calculate reminder of
Numx/den to found correct rounding.
Somehow, I suspect that on modern PC even non-optimized method
like above will be faster tham 670 usec.
Well, first step is building with GCC rather than MSVC...
It would appear that it gets roughly 79% faster when built with
GCC. So, around 2 million divides per second.
As for GMP, dividing two 40 digit numbers:
22 million per second.
If I do both a divide and a remainder:
16 million.
I don't really get what you are wanting me to measure exactly
though...
I want you to measure division of 74-digit integer by 34-digit
integer, because it is the slowest part [of brute force
implementation] of Decimal128 division. The rest of division is
approximately the same as multiplication.
So, [unoptimized] Decimal128 division time should be no worse than
t1+t2, where t1 is duration of Decimal128 multiplication and t2 is
duration of above-mentioned integer division. An estimate is
pessimistic, because post-division normalization tends to be simpler
than post-multiplication normalization.
Optimized division would be faster yet.
If it is a big-integer divide, this is not quite the same thing.
And, if I were to use big-integer divide (probably not via GMP
though,
Certainly not via GMP in final product. But doing 1st version via GMP
makes perfect sense.
this would be too big of a dependency), there is still the
issue of efficiently converting between big-integer and the "groups
of 9 digits in 32-bits" format.
No, no, no. Not "group of 9 digits"! Plain unadulterated binary. 64
binary 'digits' per 64-bit word.
This is partly why I removed the BID code:
At first, it seemed like the DPD and BID converters were similar
speed; But, turns out I was still testing the DPD converter, and
in-fact the BID converter was significantly slower.
DPD-specific code and algorithms make sense for multiplication.
They likely makes sense for addition/subtraction as well, I didn't try
to think deeply about it.
But for division I wouldn't bother with DPD-specific things. Just
convert mantissa from DPD to binary, then divide, normalize, round then convert back.
And, if I were going to do BID, would make more sense to do it as its
own thing, and build it mostly around 128-bit integer math.
But, in this case, I had decided to experiment with DPD.
Most likely, in this case if I wanted faster divide, that also played
well with the existing format, I would need to do long division or
similar.
If I compare against the IBM decNumber library:
Multiply: 14 million.
Divide: 7 million
The decNumber library doesn't appear to have a square-root
function...
Granted, there are possibly faster ways to do divide, versus using
Newton-Raphson in this case...
It was not the point that I could pull the fastest possible
implementation out of thin air. But, does appear I am beating
decNumber at least for multiply performance and similar.
Can note that while decNumber exists, at the moment, it is over 10x
more code...
My 66000 ISA has OpCodes in the range {I.major >= 8 && I.major < 24}
that can supply constants and perform operand routing. Within this
range; instruction<8:5> specify the following table:
0 0 0 0 +Src1 +Src2
0 0 0 1 +Src1 -Src2
0 0 1 0 -Src1 +Src2
0 0 1 1 -Src1 -Src2
0 1 0 0 +Src1 +imm5
0 1 0 1 +Imm5 +Src2
0 1 1 0 -Src1 -Imm5
0 1 1 1 +Imm5 -Src2
1 0 0 0 +Src1 Imm32
1 0 0 1 Imm32 +Src2
1 0 1 0 -Src1 Imm32
1 0 1 1 Imm32 -Src2
1 1 0 0 +Src1 Imm64
1 1 0 1 Imm64 +Src2
1 1 1 0 -Src1 Imm64
1 1 1 1 Imm64 -Src2
Here we have access to {5, 32, 64}-bit constants, 16-bit constants
come from different OpCodes.
Imm5 are the register specifier bits: range {-31..31} for integer and >logical, range {-15.5..15.5} for floating point.
On 2025-11-11 4:18 p.m., Thomas Koenig wrote:
Robert Finch <robfi680@gmail.com> schrieb:
Typical process for NaN boxing is to set the high order bits of the
value which causes the value to appear to be a NaN at higher precision.
I have been thinking about using some of the high order bits of the NaN
(eg bits 32 to 51) to indicate the precision of the boxed value. This
would allow detection of the use of a lower precision value in
arithmetic. Suppose a convert from single to double precision is being
done, but the value to be converted is only half precision.
Do you mean a type mismatch, a conversion, or digits lost due to
cancellation?
It would be an input type mismatch. >
I suppose the float package could always just automatically upgrade the precision from lower to higher when it goes to do the calculation. ButIf it were
indicated by the NaN software might be able to fix the result.
Fixing a result after an NaN has occurred is too late, I think.
maybe with a trace warning. It would be able to if the precision were indicated in the NaN.
For FP, Arm32 has an 8-bit immediate turned into an FP number as follows:
sign = imm8<7>;
exp = NOT(imm8<6>):Replicate(imm8<6>,E-3):imm8<5:4>;
frac = imm8<3:0>:Zeros(F-4);
result = sign : exp : frac;
For Float, exp[7:0] can be 0x80-0x83 or 0x7c-0x7f, which is 2^1 through 2^4 and 2^-3 through 2^0. And the mantissa upper 4 bits are from the immediate field. Note that 0.0 is not encodeable, and I'm going to assume you
don't need it either.
For your FP, the sign comes from elsewhere, so you have 5 bits for the
FP number. I suggest you use the Arm32 encoding for the exponent (using
3 bits), and then set the upper 2 bits of the mantissa from the remaining
two immediate bits.
This encodes integers from 1.0 through 8.0, and can also encode 10.0, 12.0, 14.0, 16.0, 20.0, 24.0, and 28.0. And it can do 0.5, 1.5, 2.5, 3.5.
And it can encode 0.125 and 0.25.
This encoding makes a lot of sense from ease of decode. However, it
would be nice to be able to encode 100.0, 1000.0 and .1, .01 and .001, each of which is likely to be more useful than 12.0 or 3.5.
From a compiler standpoint, having arbitrary constants is perfectly fine,
it can just look up if it's available. So you can make 1000.0 and .001
and PI and lg2(e) and ln(2), and whatever available, if you want.
GCC looks up Arm64 integer 13-bit immediates in a hashtable--the encoding
is almost a one-way function, so it's just faster to look it up rather than try to figure out if 0xaaaaaaaa is encodeable out by inspecting the value.
So something similar could be done for FP constants. Since the values will be fixed, a perfect hash can be created ensuring it's a fast lookup.
On 11/11/2025 6:03 AM, Michael S wrote:
On Tue, 11 Nov 2025 04:44:48 -0600
BGB <cr88192@gmail.com> wrote:
Certainly not via GMP in final product. But doing 1st version via
GMP makes perfect sense.
GMP is only really an option for targets where GMP exists;
Needed to jump over to GCC in WSL just to test GMP here.
If avoidable, you don't want to use anything beyond the C standard
library, and ideally limit things to a C95 style dialect for maximum portability.
Granted, it does appear like the GMP divider is faster than expected.
Like, possibly something faster than "ye olde shift-and-subtract".
Though, can note a curious property:
This code is around 79% faster when built with GCC vs MSVC;
In GCC, the relative speed of MUL and ADD trade places:
In MSVC, MUL is faster;
In GCC, ADD is faster.
Though, the code in question tends to frequently use struct members directly, rather than caching multiply-accessed struct members in
local variables. MSVC tends not to fully optimize away this sort of
thing, whereas GCC tends to act as-if the struct members had in-fact
been cached in local variables.
this would be too big of a dependency), there is still the
issue of efficiently converting between big-integer and the "groups
of 9 digits in 32-bits" format.
No, no, no. Not "group of 9 digits"! Plain unadulterated binary. 64
binary 'digits' per 64-bit word.
Alas, the code was written mostly to use 9-digit groupings, and going between 9-digit groupings and 128-bit integers is a bigger chunk of
code than I want to have for this.
This would mean an additional ~ 500 LOC, plus probably whatever code
I need to do a semi-fast 256 by 128 bit integer divider.
This is partly why I removed the BID code:
At first, it seemed like the DPD and BID converters were similar
speed; But, turns out I was still testing the DPD converter, and
in-fact the BID converter was significantly slower.
DPD-specific code and algorithms make sense for multiplication.
They likely makes sense for addition/subtraction as well, I didn't
try to think deeply about it.
But for division I wouldn't bother with DPD-specific things. Just
convert mantissa from DPD to binary, then divide, normalize, round
then convert back.
It is the 9-digit-decimal <-> Large Binary Integer converter step
that is the main issue here.
Going to/from 128-bit integer adds a few "there be dragons here"
issues regarding performance.
At the moment, I don't have a fast (and correct) converter between
these two representations (that also does not rely on any external
libraries or similar; or nothing outside of the C standard library).
Like, if you need to crack 128 bits into 9-digit chunks using 128-bit divide, and if the 128-bit divider in question is a
shift-and-subtract loop, this sucks.
There are faster ways to do multiply by powers of 10, but divide by powers-of-10 is still a harder problem at the moment.
Well, and also there is the annoyance that it is difficult to write
an efficient 128-bit integer multiply if staying within the limits of portable C95.
...
Goes off and tries a few things:
128-bit integer divider;
Various attempts at decimal long divide;
...
Thus far, things have either not worked correctly, or have ended up
slower than the existing Newton-Raphson divider.
the most promising option would be Radix-10e9 long-division, but
couldn't get this working thus far.
Did also try Radix-10 long division (working on 72 digit sequences),
but this was slower than the existing N-R divider.
One possibility could be to try doing divide with Radix-10 in an
unpacked BCD variant (likely using bytes from 0..9). Here, compare
and subtract would be sower, but shifting could be faster, and allows
a faster way (lookup tables) to find "A goes into B, N times".
I still don't have much confidence in it though.
Radix-10e9 has a higher chance of OK performance, if I could get the long-division algo to work correctly with it. Thus far, I was having difficulty getting it to give the correct answer. Integer divide was
tending to overshoot the "A goes into B N times" logic, and trying to
fudge it (eg, but adding 1 to the initial divisor) wasn't really
working; kinda need an accurate answer here, and a reliable way to
scale and add the divisor, ...
Granted, one possibility could be to expand out each group of 9
digits to 64 bits, so effectively it has an intermediate 10 decimal
digits of headroom (or two 10e9 "digits").
But, yeah, long-division is a lot more of a PITA than N-R or shift-and-subtract.
In article <1762377694-5857@newsgrouper.org>,
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
My 66000 ISA has OpCodes in the range {I.major >= 8 && I.major < 24}
that can supply constants and perform operand routing. Within this
range; instruction<8:5> specify the following table:
0 0 0 0 +Src1 +Src2
0 0 0 1 +Src1 -Src2
0 0 1 0 -Src1 +Src2
0 0 1 1 -Src1 -Src2
0 1 0 0 +Src1 +imm5
0 1 0 1 +Imm5 +Src2
0 1 1 0 -Src1 -Imm5
0 1 1 1 +Imm5 -Src2
1 0 0 0 +Src1 Imm32
1 0 0 1 Imm32 +Src2
1 0 1 0 -Src1 Imm32
1 0 1 1 Imm32 -Src2
1 1 0 0 +Src1 Imm64
1 1 0 1 Imm64 +Src2
1 1 1 0 -Src1 Imm64
1 1 1 1 Imm64 -Src2
Here we have access to {5, 32, 64}-bit constants, 16-bit constants
come from different OpCodes.
Imm5 are the register specifier bits: range {-31..31} for integer and >logical, range {-15.5..15.5} for floating point.
For FP, Arm32 has an 8-bit immediate turned into an FP number as follows:
sign = imm8<7>;
exp = NOT(imm8<6>):Replicate(imm8<6>,E-3):imm8<5:4>;
frac = imm8<3:0>:Zeros(F-4);
result = sign : exp : frac;
For Float, exp[7:0] can be 0x80-0x83 or 0x7c-0x7f, which is 2^1 through 2^4 and 2^-3 through 2^0. And the mantissa upper 4 bits are from the immediate field. Note that 0.0 is not encodeable, and I'm going to assume you
don't need it either.
For your FP, the sign comes from elsewhere, so you have 5 bits for the
FP number. I suggest you use the Arm32 encoding for the exponent (using
3 bits), and then set the upper 2 bits of the mantissa from the remaining
two immediate bits.
This encodes integers from 1.0 through 8.0, and can also encode 10.0, 12.0, 14.0, 16.0, 20.0, 24.0, and 28.0. And it can do 0.5, 1.5, 2.5, 3.5.
And it can encode 0.125 and 0.25.
This encoding makes a lot of sense from ease of decode. However, it
would be nice to be able to encode 100.0, 1000.0 and .1, .01 and .001, each of which is likely to be more useful than 12.0 or 3.5.
From a compiler standpoint, having arbitrary constants is perfectly fine,
it can just look up if it's available.
So you can make 1000.0 and .001
and PI and lg2(e) and ln(2), and whatever available, if you want.
GCC looks up Arm64 integer 13-bit immediates in a hashtable--the encoding--- Synchronet 3.21a-Linux NewsLink 1.2
is almost a one-way function, so it's just faster to look it up rather than try to figure out if 0xaaaaaaaa is encodeable out by inspecting the value.
So something similar could be done for FP constants. Since the values will be fixed, a perfect hash can be created ensuring it's a fast lookup.
Kent
On 11/11/2025 4:31 PM, MitchAlsup wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 11/11/2025 11:46 AM, MitchAlsup wrote:
Niklas Holsti <niklas.holsti@tidorum.invalid> posted:
snip
It is only if your machine has some semantics for instruction
combinations, such as your VEC-LOOP pair, that you have to define what >>>>> happens if a jump or call to some address leads to later executing
only
some of those instructions or executing them in the wrong order,
such as
trying to execute a LOOP without having executed a preceding VEC.
BTW, encountering a LOOP without encountering a VEC is a natural
occurrence when returning from exception or interrupt. The VEC
register points at the VEC+1 instruction which is easy to return
to the VEC instruction.
OK, but what if, say through an errant pointer, the code, totally
unrelated to the VEC, jumps somewhere in the middle of a VEC/LOOP pair?
All taken branches clear the V-bit associated with vectorization.
So encountering the LOOP instruction would raise an exception.
Seems like the right thing to do. I believe this resolves Nikals's issue.
On 2025-11-12 3:18, Stephen Fuld wrote:
On 11/11/2025 4:31 PM, MitchAlsup wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 11/11/2025 11:46 AM, MitchAlsup wrote:All taken branches clear the V-bit associated with vectorization.
Niklas Holsti <niklas.holsti@tidorum.invalid> posted:
snip
It is only if your machine has some semantics for instruction
combinations, such as your VEC-LOOP pair, that you have to define what >>>>> happens if a jump or call to some address leads to later executing >>>>> only
some of those instructions or executing them in the wrong order,
such as
trying to execute a LOOP without having executed a preceding VEC.
BTW, encountering a LOOP without encountering a VEC is a natural
occurrence when returning from exception or interrupt. The VEC
register points at the VEC+1 instruction which is easy to return
to the VEC instruction.
OK, but what if, say through an errant pointer, the code, totally
unrelated to the VEC, jumps somewhere in the middle of a VEC/LOOP pair? >>
So encountering the LOOP instruction would raise an exception.
Seems like the right thing to do. I believe this resolves Nikals's issue.
Yes, in the sense that this example supports my statement (above) that
in a machine that has instruction combinations (like VEC-LOOP) that must
be executed in a certain order, it is necessary to address what happens
if a jump or call breaks that order, complicating the semantics
definition. I agree that an exception seems the right thing to do here,
and I expected it.
Connecting this to the labels-as-values discussion, this means that a C compiler that compiles a C loop into a VEC-LOOP machine loop, and allows
a "goto" to a label within that loop, from outside the loop, would
result in execution that fails due to this exception, whether the label
is statically named or referenced by a label-valued variable. So I would wish that the compiler would prevent that at compile time, to avoid
possible UB.
Robert Finch <robfi680@gmail.com> schrieb:
On 2025-11-11 4:18 p.m., Thomas Koenig wrote:
Robert Finch <robfi680@gmail.com> schrieb:
Typical process for NaN boxing is to set the high order bits of the
value which causes the value to appear to be a NaN at higher precision. >>> I have been thinking about using some of the high order bits of the NaN >>> (eg bits 32 to 51) to indicate the precision of the boxed value. This
would allow detection of the use of a lower precision value in
arithmetic. Suppose a convert from single to double precision is being >>> done, but the value to be converted is only half precision.
Do you mean a type mismatch, a conversion, or digits lost due to
cancellation?
It would be an input type mismatch. >
I think this can only happen when software is buggy; compilers should
deal with it, unless the user intentionally accesses data with
the wrong type.
I suppose the float package could always just automatically upgrade the precision from lower to higher when it goes to do the calculation. But maybe with a trace warning. It would be able to if the precision were indicated in the NaN.If it were
indicated by the NaN software might be able to fix the result.
Fixing a result after an NaN has occurred is too late, I think.
I have implemented a few warning about conversions in gfortran.
For example, -Wconversion-extra gives you, for the program
program main
print *,0.3333333333
end program main
the warning
2 | print *,0.3333333333
| 1
Warning: Non-significant digits in 'REAL(4)' number at (1), maybe incorrect KIND [-Wconversion-extra]
But my favorite is
3 | print *,a**(3/5)
| 1--- Synchronet 3.21a-Linux NewsLink 1.2
Warning: Integer division truncated to constant '0' at (1) [-Winteger-division]
which (presumably) has caught that particular idiom in a few codes.
Niklas Holsti wrote:
On 2025-11-06 20:28, MitchAlsup wrote:
Niklas Holsti <niklas.holsti@tidorum.invalid> posted:
On 2025-11-05 23:28, MitchAlsup wrote:
----------------
Niklas Holsti <niklas.holsti@tidorum.invalid> posted:
But then you could get the problem of a longjmp to a setjmp value that >>>>>> is stale because the targeted function invocation (stack frame) is no >>>>>> longer there.
But YOU had to pass the jumpbuf out of the setjump() scope.
Now, YOU complain there is a hole in your own foot with a smoking gun >>>>> in your own hand.
That is not the issue. The question is if the semantics of "goto
label-valued-variable" are hard to define, as Ritchie said, or not, as >>>> Anton thinks Stallman said or would have said.
So, label-variables are hard to define, but function-variables are not
?!?
Depends on the level at which you want to define it.
At the machine level, where semantics are (usually) defined for each
instruction separately, a jump to a dynamic address (using a
"label-variable") is not much different from a call to a dynamic address
(using a "function-variable"), and the effect of the single instruction
on the machine state is much the same as for the static address case.
The higher-level effect on the further execution of the program is out
of scope, whatever the actual value of the target address in the
instruction.
It is only if your machine has some semantics for instruction
combinations, such as your VEC-LOOP pair, that you have to define what
happens if a jump or call to some address leads to later executing only
some of those instructions or executing them in the wrong order, such as
trying to execute a LOOP without having executed a preceding VEC.
At the higher programming-language level, the label case can be much
harder to define and less useful than the function case, depending on
the programming language and its abstract model of execution, and also
depending on what compile-time checks you assume.
Consider an imperative language such as C with no functions nested
within other functions or other blocks (where by "block" I mean some
syntactical construct that sets up its local context with local
variables etc.). If you have a function-variable (that is, a pointer to
a function) that actually refers to a function with the same parameter
profile, it is easy to define the semantics of a call via this function
variable: it is the same as for a call that names the referenced
function statically, and such a call is always legal. Problems arise
only if the function-variable has some invalid value such as NULL, or
the address of a function with a different profile, or some code address
that does not refer to (the start of) a function. Such invalid values
can be prevented at compile time, except (usually) for NULL.
In the same language setting, the semantics of a jump using a
label-variable are easy to define only if the label-variable refers to a
label in the same block as the jump. A jump from one block into another
would mess up the context, omitting the set-up of the target block's
context and/or omitting the tear-down of the source block's context. The
further results of program execution are machine-dependent and so
undefined behavior.
A compiler could enforce the label-in-same-block rule, but it seems that
GNU C does not do so.
In a programming language that allows nested functions the same kind of
context-crossing problems arise for function-variables. Traditional
languages solve them by allowing, at compile-time, calls via
function-variables only if it is certain that the containing context of
the callee still exists (if the callee is nested), or by (expensively)
preserving that context as a dynamically constructed closure. In either
case, the caller's context never needs to be torn down to execute the
call, differing from the jump case.
In summary, jumps via label-variables are useful only for control
transfers within one function, and do not help to build up a computation
by combining several functions -- the main method of program design at
present. In contrast, calls via function-variables are a useful
extension to static calls, actually helping to combine several functions
in a computation, as shown by the general adoption of
class/object/method coding styles.
Niklas
I was curious about the interaction between dynamic stack allocations
and goto variables to see if it handled the block scoping correctly.
Ada should have the same issues as C.
It appears GCC x86-64 15.2 with -O3 does not properly recover
stack space with dynamic goto's.
Test1 allocates a dynamic sized buffer and has a static goto Loop
for which GCC generates a jne .L6 to a mov rsp, rbx that recovers
the stack allocation inside the {} block.
Test2 is the same but does a goto *dest and GCC does not generate
code to recover the inner {} block allocation. It just loops over
the sub rsp, rbx so the stack space just grows.
long Sub (long len, char buf[]);
void Test1 (long len)
{
long ok;
Loop:
{
char buf[len];
ok = Sub (len, buf);
if (ok)
goto Loop;
}
}
Thomas Koenig <tkoenig@netcologne.de> posted:
Robert Finch <robfi680@gmail.com> schrieb:
On 2025-11-11 4:18 p.m., Thomas Koenig wrote:
Robert Finch <robfi680@gmail.com> schrieb:
Typical process for NaN boxing is to set the high order bits of the
value which causes the value to appear to be a NaN at higher precision. >>>>> I have been thinking about using some of the high order bits of the NaN >>>>> (eg bits 32 to 51) to indicate the precision of the boxed value. This >>>>> would allow detection of the use of a lower precision value in
arithmetic. Suppose a convert from single to double precision is being >>>>> done, but the value to be converted is only half precision.
Do you mean a type mismatch, a conversion, or digits lost due to
cancellation?
It would be an input type mismatch. >
I think this can only happen when software is buggy; compilers should
deal with it, unless the user intentionally accesses data with
the wrong type.
I suppose the float package could always just automatically upgrade theIf it were
indicated by the NaN software might be able to fix the result.
Fixing a result after an NaN has occurred is too late, I think.
precision from lower to higher when it goes to do the calculation. But
maybe with a trace warning. It would be able to if the precision were
indicated in the NaN.
I have implemented a few warning about conversions in gfortran.
For example, -Wconversion-extra gives you, for the program
program main
print *,0.3333333333
end program main
the warning
2 | print *,0.3333333333
| 1
Warning: Non-significant digits in 'REAL(4)' number at (1), maybe incorrect KIND [-Wconversion-extra]
But my favorite is
3 | print *,a**(3/5)
BTW, this works in eXcel where 3/5 = 0.6
AND, in My 66000, a**0.6 is a single instruction. ...
| 1
Warning: Integer division truncated to constant '0' at (1) [-Winteger-division]
which (presumably) has caught that particular idiom in a few codes.
Thomas Koenig <tkoenig@netcologne.de> posted:
But my favorite is
3 | print *,a**(3/5)
BTW, this works in eXcel where 3/5 = 0.6
Anton Ertl wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
void Test2 (long len)
{
long ok;
void *dest;
dest = &&Loop;
Loop:
{
char buf[len];
ok = Sub (len, buf);
if (ok)
goto *dest;
}
}
Test2(long):
push rbp
mov rbp, rsp
push r12
mov r12, rdi
push rbx
lea rbx, [rdi+15]
shr rbx, 4
sal rbx, 4
.L8:
sub rsp, rbx
mov rdi, r12
mov rsi, rsp
call Sub(long, char*)
test rax, rax
jne .L8
lea rsp, [rbp-16]
pop rbx
pop r12
pop rbp
ret
Interesting that this bug has not been fixed in the >33 years that
labels-as-values have been in gcc; I don't know how long these
dynamically sized arrays have been in gcc, but IIRC alloca(), a
similar feature, has been available at least as long as
labels-as-values. The bug has apparently been avoided or worked
around by the users of labels-as-values (e.g., Gforth does not use
alloca or dynamically-sized arrays in the function that contains all
the taken labels and all the "goto *"s.
alloca is not required to recover storage at the {} block level.
But when they added dynamic allocation to C as a first class feature
I figured it should recover storage at the end of a {} block,
and I wondered it the superficially non-deterministic nature of
goto variable would be a problem.
This all relates to Niklas's comments as to why the label variables must
all be within the current context, so it knows when to recover storage.
If the language had destructors the goto variable could have to call them >which alloca also does not deal with.
long Sub (long len, char buf[]);
void Test3 (long len)
{
long ok, dest;
dest = 0;
Loop:
{
char buf[len];
ok = Sub (len, buf);
if (ok)
dest = 1;
switch (dest)
{
case 0:
goto Loop;
case 1:
goto Out;
}
Out:
;
}
}
I almost agree, except for C95.
Also, I wouldn't consider such project without few extensions of
standard language. As a minimum:
- ability to get upper 64 bit of 64b*64b product
- convenient way to exploit 64-bit add with carry
IIRC there is clear statement in the C standard that you are not
allowed to jump into a scope after a dynamic declaration. This
restriction is because otherwise compiler would need some twisty
logic to run allocation code.
With label variables that obvoiusly
generalizes to jumps outside of scope of dynamic allocation:
So natural restriction is: when jumping to label variable
dynamic locals may be released only at function exit.
Michael S <already5chosen@yahoo.com> writes:
I almost agree, except for C95.
What is C95? I only know of C89/90, C99, C11, C23.
Also, I wouldn't consider such project without few extensions of
standard language. As a minimum:
- ability to get upper 64 bit of 64b*64b product
- convenient way to exploit 64-bit add with carry
I have explored these topics recently in "Multi-precision integer arithmetics" <http://www.complang.tuwien.ac.at/anton/tmp/carry2.pdf>.
Actually, with uint128_t you get pretty far, and _BitInt(bits) has
been added in C23, which has good potential, but is not quite there.
Builtins for add-with-carry and intrinsics are somewhat disappointing.
- anton
antispam@fricas.org (Waldek Hebisch) writes:
IIRC there is clear statement in the C standard that you are not
allowed to jump into a scope after a dynamic declaration. This
restriction is because otherwise compiler would need some twisty
logic to run allocation code.
Not just that. If the dynamic definition is not executed, it's
unclear how much should be allocated. Consider:
n=-5;
goto L;
n = m; // dead code
{
int x[n]; // dead code
n=0; // dead code
L:
... x[3] ...
...
}
With label variables that obvoiusly
generalizes to jumps outside of scope of dynamic allocation:
This is a use of "obviously" that wants the reader to skip thinking
about the issue (and maybe the writer has not thought about it,
either). But actually, the cases are completely different.
If control flow passed through the dynamic definition on the way to
the goto, the stack depth in its scope is known, and can be restored
when performing the goto, as I showed in <2025Nov13.094235@mips.complang.tuwien.ac.at>.
So natural restriction is: when jumping to label variable
dynamic locals may be released only at function exit.
A compiler bug is not a natural restriction. Of course, the gcc
people might decide not to fix the bug (after all, no production code
is affected by this bug), and declare it undefined behaviour to, say,
perform a goto * inside a scope with a dynamic array that jumps
outside the scope, but if they do something like this, it's a human
decision based on a cost-benefit analysis, not something natural.
GNU C has no destructors.It has, in limited form via __attribute__((__cleanup__(...)))
On Thu, 13 Nov 2025 09:24:20 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Actually, with uint128_t you get pretty far, and _BitInt(bits) has
been added in C23, which has good potential, but is not quite there.
Yes, that what I wrote above.
As far as BGB is concerned, the big disadvantage is absence of support
by MSVC.
Builtins for add-with-carry and intrinsics are somewhat disappointing.
- anton
For me the most disappointing part is that different architectures
have different spellings.
Other than that even gcc now mostly able to generate
decent code for Intel's variant. MSVC and clang were able to do it for
very long time.
Or do you have in mind new gcc intrinsic in a group "Arithmetic with
Overflow Checking" ?
On Tue, 11 Nov 2025 21:34:08 -0600------------------
C99 is, may be, too much, but C99 sub/super set known as C11 soundshll:
about right.
Also, I wouldn't consider such project without few extensions of
standard language. As a minimum:
- ability to get upper 64 bit of 64b*64b product
- convenient way to exploit 64-bit add with carryhll:
- MS _BitScanReverse64 or Gnu __builtin_ctzll or equivalen
Not really.
That is, conversions are not blazingly fast, but still much better
than any attempt to divide in any form of decimal. And helps to
preserve your sanity.
There is also psychological factor at play - your users expect--- Synchronet 3.21a-Linux NewsLink 1.2
division and square root to be slower than other primitive FP
operations, so they are not disappointed. Possibly they are even
pleasantly surprised, when they find out that the difference in
throughput between division and multiplication is smaller than factor
20-30 that they were accustomed to for 'double' on their 20 y.o. Intel
and AMD.
Michael S <already5chosen@yahoo.com> writes:
I almost agree, except for C95.
What is C95? I only know of C89/90, C99, C11, C23.
Also, I wouldn't consider such project without few extensions of
standard language. As a minimum:
- ability to get upper 64 bit of 64b*64b product
- convenient way to exploit 64-bit add with carry
I have explored these topics recently in "Multi-precision integer arithmetics" <http://www.complang.tuwien.ac.at/anton/tmp/carry2.pdf>.
Actually, with uint128_t you get pretty far, and _BitInt(bits) has
been added in C23, which has good potential, but is not quite there.
Builtins for add-with-carry and intrinsics are somewhat disappointing.
Michael S <already5chosen@yahoo.com> writes:
On Thu, 13 Nov 2025 09:24:20 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Actually, with uint128_t you get pretty far, and _BitInt(bits) has
been added in C23, which has good potential, but is not quite there.
Yes, that what I wrote above.
As far as BGB is concerned, the big disadvantage is absence of support
by MSVC.
Why would that be a disadvantage? If MSVC does not do what he needs,
there are other C compilers to choose from.
Builtins for add-with-carry and intrinsics are somewhat disappointing.
- anton
For me the most disappointing part is that different architectures
have different spellings.
For intrinsics that's by design. They are essentially a way to write assembly language instructions in Fortran or C. And assembly language
is compiler-specific.
Other than that even gcc now mostly able to generate
decent code for Intel's variant. MSVC and clang were able to do it for
very long time.
When using the Intel intrinsic c_out = _addcarry_u64(c_in, s1, s2,&sum),
the code from both gcc and clang uses adcq, but cannot preserve the
carry in CF in a loop, and moves it into a register right after the
adcq, and back from the register to CF right before:
addb $-1, %r8b
adcq (%rdx,%rax,8), %r9
setb %r8b
If you (or compiler unrolling) have several _addcarry_u64 in a row,
with the carry-out becoming the carry-in of the next one, at least one
of these compilers manages to eliminate the overhead between these
adcqs, but of course not at the start and end of the sequence.
Or do you have in mind new gcc intrinsic in a group "Arithmetic with >Overflow Checking" ?
These are gcc builtins, not intrinsics. The difference is that they
work on all architectures. However, when I looked (three months ago),
gcc did not have a builtin with carry-in; the builtins you mention
only provide carry-out (or overflow-out).
However, clang has a builtin with carry-in and carry-out:
sum = __builtin_addcll(s1, s2, c_in, &c_out)
Unfortunately, the code produced by clang is pretty horrible for ARM
A64 and AMD64:
ARM A64: # clang 11.0.1 -Os
adds x9, x9, x10
cset w10, hs
adds x9, x9, x8
cset w8, hs
orr w8, w10, w8
AMD64: # clang 14.0.6 -march=x86-64-v4 -Os
addq (%rdx,%r8,8), %r9
setb %r10b
addq %rax, %r9
setb %al
orb %r10b, %al
movzbl %al, %eax
For RISC-V the code is a five-instruction sequence, which is the
minimum that's possible on RISC-V.
- anton--- Synchronet 3.21a-Linux NewsLink 1.2
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Michael S <already5chosen@yahoo.com> writes:
On Thu, 13 Nov 2025 09:24:20 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Actually, with uint128_t you get pretty far, and _BitInt(bits) has
been added in C23, which has good potential, but is not quite there.
Yes, that what I wrote above.
As far as BGB is concerned, the big disadvantage is absence of support
by MSVC.
Why would that be a disadvantage? If MSVC does not do what he needs,
there are other C compilers to choose from.
Builtins for add-with-carry and intrinsics are somewhat disappointing.
- anton
For me the most disappointing part is that different architectures
have different spellings.
For intrinsics that's by design. They are essentially a way to write
assembly language instructions in Fortran or C. And assembly language
is compiler-specific.
{Pedantic mode=ON}
Assembly language is ASSEMBLER specific.
Can note that GCC seemingly doesn't support 128-bit integers on 64-bit >RISC-V.
Also, doing 128-bit arithmetic on RV64 kinda sucks as there is
basically no good way to do extended precision arithmetic (essentially,
the ISA offers nothing more here than what C already gives you).
Like, you can do what is essentially:
c_lo = a_lo + b_lo;
c_hi = a_hi + b_hi;
if((c_lo<a_lo) || (c_lo<b_lo))
c_hi++;
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Michael S <already5chosen@yahoo.com> writes:
On Thu, 13 Nov 2025 09:24:20 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Actually, with uint128_t you get pretty far, and _BitInt(bits) has
been added in C23, which has good potential, but is not quite there.
Yes, that what I wrote above.
As far as BGB is concerned, the big disadvantage is absence of support
by MSVC.
Why would that be a disadvantage? If MSVC does not do what he needs,
there are other C compilers to choose from.
Builtins for add-with-carry and intrinsics are somewhat disappointing. >> >>
- anton
For me the most disappointing part is that different architectures
have different spellings.
For intrinsics that's by design. They are essentially a way to write
assembly language instructions in Fortran or C. And assembly language
is compiler-specific.
{Pedantic mode=ON}
Assembly language is ASSEMBLER specific.
What I wanted to write was "And assembly language is
architecture-specific".
It's the builtin function that are compiler-specific.--- Synchronet 3.21a-Linux NewsLink 1.2
- anton
BGB <cr88192@gmail.com> writes:
Can note that GCC seemingly doesn't support 128-bit integers on 64-bit >RISC-V.
What makes you think so? It has certainly worked every time I tried
it. E.g., Gforth's "configure" reports:
checking size of __int128_t... 16
checking size of __uint128_t... 16
[...]
checking for a C type for double-cells... __int128_t
checking for a C type for unsigned double-cells... __uint128_t
That's with gcc 10.3.1
Also, doing 128-bit arithmetic on RV64 kinda sucks as there is
basically no good way to do extended precision arithmetic (essentially, >the ISA offers nothing more here than what C already gives you).
Like, you can do what is essentially:
c_lo = a_lo + b_lo;
c_hi = a_hi + b_hi;
if((c_lo<a_lo) || (c_lo<b_lo))
c_hi++;
You only need to check for c_lo<a_lo (or for c_lo<b_lo), they will
either both be true or both be false.
Here's 128-bit arithmetic on RV64GC (and very similar on MIPS and
Alpha):
add a4,a4,a5
sltu a5,a4,a5
add s8,s8,s9
add s9,a5,s8
RISC-V (and MIPS and Alpha) becomes relly bad when you need add with
carry-in and carry-out (five instructions).
- anton--- Synchronet 3.21a-Linux NewsLink 1.2
BGB <cr88192@gmail.com> writes:
Can note that GCC seemingly doesn't support 128-bit integers on 64-bit
RISC-V.
What makes you think so? It has certainly worked every time I tried
it. E.g., Gforth's "configure" reports:
checking size of __int128_t... 16
checking size of __uint128_t... 16
[...]
checking for a C type for double-cells... __int128_t
checking for a C type for unsigned double-cells... __uint128_t
That's with gcc 10.3.1
Also, doing 128-bit arithmetic on RV64 kinda sucks as there is
basically no good way to do extended precision arithmetic (essentially,
the ISA offers nothing more here than what C already gives you).
Like, you can do what is essentially:
c_lo = a_lo + b_lo;
c_hi = a_hi + b_hi;
if((c_lo<a_lo) || (c_lo<b_lo))
c_hi++;
You only need to check for c_lo<a_lo (or for c_lo<b_lo), they will
either both be true or both be false.
Here's 128-bit arithmetic on RV64GC (and very similar on MIPS and
Alpha):
add a4,a4,a5
sltu a5,a4,a5
add s8,s8,s9
add s9,a5,s8
RISC-V (and MIPS and Alpha) becomes relly bad when you need add with
carry-in and carry-out (five instructions).
On 11/13/2025 3:58 PM, Anton Ertl wrote:
BGB <cr88192@gmail.com> writes:
Can note that GCC seemingly doesn't support 128-bit integers on 64-bit
RISC-V.
What makes you think so? It has certainly worked every time I tried
it. E.g., Gforth's "configure" reports:
checking size of __int128_t... 16
checking size of __uint128_t... 16
[...]
checking for a C type for double-cells... __int128_t
checking for a C type for unsigned double-cells... __uint128_t
That's with gcc 10.3.1
Hmm...
Seems so.
Testing again, it does appear to work; the error message I thought I remembered seeing, instead applied to when trying to use the type in
MSVC. I had thought I remembered checking before and it failing, but it seems not.
But, yeah, good to know I guess.
As for MSVC:
tst_int128.c(5): error C4235: nonstandard extension used: '__int128'
keyword not supported on this architecture
Never got around to adding a 3R ADDC (and as-is is basically the same
idiom as carried over from SH-4).
On XG3, the latter is no longer formally allowed (partly for consistency >with RISC-V), but nothing technically prevents it (support for SR.T and >predication was demoted to optional, and currently not enabled by default).
Could maybe still make sense to add a 3R ADDC though at some point, as
it could help with 256-bit arithmetic (and 256-bit stuff is not
addressed by ALUX).
Does make me wonder if similar ideas could apply to things like software
and CPU architecture. Like, possible higher peaks that could potentially >lead to significant improvements in performance or capability, but
nothing can reach them as there is a "valley of suck" in the way.
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
{Pedantic mode=ON}
Assembly language is ASSEMBLER specific.
What I wanted to write was "And assembly language is
architecture-specific".
It's the builtin function that are compiler-specific.
It is possible to use an approach similar to double-dabble (feeding in
the binary number 1 bit at a time, and adding the decimal vector to
itself and incrementing for each 1 bit seen). But, alas, this is also
slow in this case (takes around 128 iterations to convert the Int128 to
4x 10e9). Though, still slightly faster than using a shift-subtract
divider to crack off 9 digit chunks by successively dividing by 1000000000.
Or, maybe make another attempt at Radix-10e9 long division and see if I
can get it to actually work and give the correct result.
Though, might be worthwhile, since if I could make the DIV operator
faster, I could claim a result of "faster than IBM's decNumber library".
Even if in practice it might still be moot, as it is still impractically slow if compared with Binary128.
BGB <cr88192@gmail.com> writes:
Never got around to adding a 3R ADDC (and as-is is basically the same >idiom as carried over from SH-4).
On XG3, the latter is no longer formally allowed (partly for consistency >with RISC-V), but nothing technically prevents it (support for SR.T and >predication was demoted to optional, and currently not enabled by default).
Could maybe still make sense to add a 3R ADDC though at some point, as
it could help with 256-bit arithmetic (and 256-bit stuff is not
addressed by ALUX).
In "Extending General-Purpose Registers with Carry and Overflow Bits" <http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf> I discuss
adding a carry bit and overflow bit to every GPR of an architecture.
To make it concrete how that would affect the instruction set, I
propose such an instruction set extension for RISC-V. It contains the instructions
addc rd, rs1, rs2
which adds the carry bit of rs2 to the 65-bit (i.e., including the
carry bit) data in rs1. The other instruction I proposed is
bo rs1, rs2, target
which branches if the overflow bit of rs1 or rs2 are set (why check
two registers? Because it fits in the RISC-V conditional branch
instruction scheme).
A 256-bit addition (d1,c1,b1,a1)+(d2,c2,b2,a2) would look as follows
add a3,a1,a2
add b3,b1,b2
addc b3,b3,a3
add c3,c1,c2
addc c3,c3,b3
add d3,d1,d2
addc d3,d3,c3
with 4 cycles latency. addc is limited to having two source registers
(RV64G instructions all have this limit). The decoder could combine a
pair of add and addc instructions into one three-source
macro-instruction. Alternatively, one could add a three-source
instruction addc4 (VAX-inspired naming) to the instruction set, and
maybe include subc4 as well.
--- Synchronet 3.21a-Linux NewsLink 1.2Does make me wonder if similar ideas could apply to things like software >and CPU architecture. Like, possible higher peaks that could potentially >lead to significant improvements in performance or capability, but
nothing can reach them as there is a "valley of suck" in the way.
Network effects favour incumbents, and network effects are strong in
computer architecture for general-purpose processors. Sometimes I
think that it's a miracle that we have seen the progress in computer architecture that we have seen:
1) We used to have a de-facto standard of 36-bit word-addressed
machines (ok, there were character-addressed and digit-addressed
machines at the time, too), and it has been superseded by a
standard of 8-byte-addressed machines with word size 16 bits, 32
bits, or 64 bits. The mechanism here seems to have been that most
of the 36-bit machines had 18-bit addresses, and, as Gordon Bell
wrote, running out of address bits spells doom for an architecture.
2) At one point (late 1980s) it looked like big-endian would win
(almost all workstations at the time, with DEC stuff being the
exception that proved the rule), but eventually little-endian won,
thanks to PCs (which inherited the Datapoint 2200 byte order) and
smart phones (which inherited the 6502 byte order).
Another, less surprising development is that trapping on unaligned
accesses is dying out in general-purpose machines. In the 1980s and
1990s most architectures trapped on unaligned accesses. But that's a "feature" that almost no software relies on, so there are no network
effects in its favour. OTOH, porting software from an architecture
that performs unaligned accesses is easier to architectures that
perform unaligned accesses. So eventually all general-purpose
architectures have converted to performing unaligned accesses, or died
out. One can see this progression already in S/360->S/370.
- anton
In "Extending General-Purpose Registers with Carry and Overflow Bits" <http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf> I discuss
adding a carry bit and overflow bit to every GPR of an architecture.
To make it concrete how that would affect the instruction set, I
propose such an instruction set extension for RISC-V. It contains the instructions
BGB wrote:
It is possible to use an approach similar to double-dabble (feeding in
the binary number 1 bit at a time, and adding the decimal vector to
itself and incrementing for each 1 bit seen). But, alas, this is also
slow in this case (takes around 128 iterations to convert the Int128
to 4x 10e9). Though, still slightly faster than using a shift-subtract
divider to crack off 9 digit chunks by successively dividing by
1000000000.
Or, maybe make another attempt at Radix-10e9 long division and see if
I can get it to actually work and give the correct result.
I used division by 1e9 to extract groups of 9 digits from the binary
result I got when calculating pi with arbitrary precision, back then (on
a 386) I did it with the obvious edx:eax / 1e9 (in ebx) -> remainder
(edx) and result (eax) in a loop, which was fast enough for tsomething I only needed to do once.
Today, with 64-bit cpus, why not use a reciprocal mul to get a value
that cannot be too high, save the result, then back-multiply and subtract?
Any off-by-one error will be caught by the next iteration.
Though, might be worthwhile, since if I could make the DIV operator
faster, I could claim a result of "faster than IBM's decNumber library".
:-)
Even if in practice it might still be moot, as it is still
impractically slow if compared with Binary128.
Right.
Terje
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
{Pedantic mode=ON}
Assembly language is ASSEMBLER specific.
What I wanted to write was "And assembly language is
architecture-specific".
foo_:
add DWORD PTR [rdi], 1
ret
and
foo_:
addl $1, (%rdi)
ret
are written in two different assembly languages, yet have the same
meaning when compiled.
It's the builtin function that are compiler-specific.
Also, not really. For x86, Intel defines them, and other
compilers like gcc follow suit.
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
In "Extending General-Purpose Registers with Carry and Overflow Bits"
<http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf> I discuss
adding a carry bit and overflow bit to every GPR of an architecture.
Which does nothing for MUL and DIV, while creating complications for
LD/ST if you want to maintain the 66-bit illusion of a GPR through
memory.
A 256-bit addition (d1,c1,b1,a1)+(d2,c2,b2,a2) would look as follows
add a3,a1,a2
add b3,b1,b2
addc b3,b3,a3
add c3,c1,c2
addc c3,c3,b3
add d3,d1,d2
addc d3,d3,c3
with 4 cycles latency.
CARRY in My 66000 essentially provides an accumulator for a few instructions >that supply more operands to and receives another result from a calculation. >Most multiprecision calculation sequences are perfectly happy with another >register used as an accumulator.
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
In "Extending General-Purpose Registers with Carry and Overflow Bits"
<http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf> I discuss
adding a carry bit and overflow bit to every GPR of an architecture.
Which does nothing for MUL and DIV, while creating complications for
LD/ST if you want to maintain the 66-bit illusion of a GPR through
memory.
I don't think the benefit is worth the cost, as do you, because you
support your CARRY functionality only in very limited sequences. So
storing stores only 64 bits, and loading only loads those bits, and
sets carry and overflow to no overflow.
--- Synchronet 3.21a-Linux NewsLink 1.2A 256-bit addition (d1,c1,b1,a1)+(d2,c2,b2,a2) would look as follows
add a3,a1,a2
add b3,b1,b2
addc b3,b3,a3
add c3,c1,c2
addc c3,c3,b3
add d3,d1,d2
addc d3,d3,c3
with 4 cycles latency.
CARRY in My 66000 essentially provides an accumulator for a few instructions >that supply more operands to and receives another result from a calculation. >Most multiprecision calculation sequences are perfectly happy with another >register used as an accumulator.
How does a four-input 2048-bit-addition look with your CARRY? For GPRs-with-flags it would look as follows:
L:
ld xn, (xp)
ld yn, (yp)
ld zn, (zp)
ld tn, (tp)
add rn, xn, yn
addc rn, rn, rm
add sn, zn, tn
addc sn, sn, sm
add vn, rn, tn
addc vn, vn, vm
sd vn, (vp)
.. #mov rn, sn, vn to rm, sm, vm
.. #increment xp yp zp tp vp
.. #loop control and branch back to L:
- anton
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
In "Extending General-Purpose Registers with Carry and Overflow Bits"
<http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf> I discuss
adding a carry bit and overflow bit to every GPR of an architecture.
Which does nothing for MUL and DIV, while creating complications for
LD/ST if you want to maintain the 66-bit illusion of a GPR through
memory.
I don't think the benefit is worth the cost, as do you, because you
support your CARRY functionality only in very limited sequences. So
storing stores only 64 bits, and loading only loads those bits, and
sets carry and overflow to no overflow.
A 256-bit addition (d1,c1,b1,a1)+(d2,c2,b2,a2) would look as follows
add a3,a1,a2
add b3,b1,b2
addc b3,b3,a3
add c3,c1,c2
addc c3,c3,b3
add d3,d1,d2
addc d3,d3,c3
with 4 cycles latency.
CARRY in My 66000 essentially provides an accumulator for a few instructions >that supply more operands to and receives another result from a calculation. >Most multiprecision calculation sequences are perfectly happy with another >register used as an accumulator.
How does a four-input 2048-bit-addition look with your CARRY? For GPRs-with-flags it would look as follows:
L:
ld xn, (xp)
ld yn, (yp)
ld zn, (zp)
ld tn, (tp)
add rn, xn, yn
addc rn, rn, rm
add sn, zn, tn
addc sn, sn, sm
add vn, rn, tn
addc vn, vn, vm
sd vn, (vp)
.. #mov rn, sn, vn to rm, sm, vm
.. #increment xp yp zp tp vp
.. #loop control and branch back to L:
- anton--- Synchronet 3.21a-Linux NewsLink 1.2
<snip>
Sidetracking a bit here.
In "Extending General-Purpose Registers with Carry and Overflow Bits"
<http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf> I discuss
adding a carry bit and overflow bit to every GPR of an architecture.
To make it concrete how that would affect the instruction set, I
propose such an instruction set extension for RISC-V. It contains the
instructions
There are 64-regs in Qupls with four flag bits,
Robert Finch <robfi680@gmail.com> writes:
<snip>
Sidetracking a bit here.
In "Extending General-Purpose Registers with Carry and Overflow Bits"
<http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf> I discuss
adding a carry bit and overflow bit to every GPR of an architecture.
To make it concrete how that would affect the instruction set, I
propose such an instruction set extension for RISC-V. It contains the
instructions
There are 64-regs in Qupls with four flag bits,
What other flags do you use?
A common set of flags is NZCV. Of these N and Z can be generated from
the 64 ordinary bits (actually N is the MSB of these bits).
You might also want NCZV of 32-bit instructions, but in that case all
flags are derivable from the 64 ordinary bits of the GPR; but in that
case you may need additional branch instructions: Instructions that
check only if the bottom 32-bits are 0 (Z), if bit 31 is 1 (N), if bit
32 is 1 (C), or if bit 32 is different from bit 31 (V).
Concerning saving the extra bits across interrupts, yes, this has to
be adapted to the actual architecture, and there are many ways to skin
this cat. I just outlined one to give an idea how this can be done.
- anton
CARRY in My 66000 essentially provides an accumulator for a few instructions that supply more operands to and receives another result from a calculation. Most multiprecision calculation sequences are perfectly happy with another register used as an accumulator.
MitchAlsup wrote:
CARRY in My 66000 essentially provides an accumulator for a few instructions
that supply more operands to and receives another result from a calculation.
Most multiprecision calculation sequences are perfectly happy with another register used as an accumulator.
I think I've said so before, but it bears repeating:
I _really_ love CARRY!
It provides a lot of "missing link" operations, while adding zero extra
bits to all the instructions that don't need it.
That said, if I had infinite resources (in this case infinity == 4
sources), I would like to have an unsigned integer MulAddAdd like this:
(hi, lo) = a*b+c+d
simply because this is the largest possible building block that cannot overflow, the result range covers the full 128 bit space.
From what you've taught us about multipliers, adding one (or in this
case two) extra inputs to the adder that aggregates all the partial multiplication products will be close to free in time, but the routing
of the extra set of inputs might require an extra cycle?
Terje--- Synchronet 3.21a-Linux NewsLink 1.2
Robert Finch <robfi680@gmail.com> writes:
<snip>
Sidetracking a bit here.
In "Extending General-Purpose Registers with Carry and Overflow Bits"
<http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf> I discuss
adding a carry bit and overflow bit to every GPR of an architecture.
To make it concrete how that would affect the instruction set, I
propose such an instruction set extension for RISC-V. It contains the
instructions
There are 64-regs in Qupls with four flag bits,
What other flags do you use?
A common set of flags is NZCV. Of these N and Z can be generated from
the 64 ordinary bits (actually N is the MSB of these bits).
You might also want NCZV of 32-bit instructions, but in that case all
flags are derivable from the 64 ordinary bits of the GPR; but in that
case you may need additional branch instructions: Instructions that
check only if the bottom 32-bits are 0 (Z), if bit 31 is 1 (N), if bit
32 is 1 (C), or if bit 32 is different from bit 31 (V).
Concerning saving the extra bits across interrupts, yes, this has to
be adapted to the actual architecture, and there are many ways to skin
this cat. I just outlined one to give an idea how this can be done.
- anton--- Synchronet 3.21a-Linux NewsLink 1.2
That said, if I had infinite resources (in this case infinity == 4
sources), I would like to have an unsigned integer MulAddAdd like this:
(hi, lo) = a*b+c+d
simply because this is the largest possible building block that cannot >overflow, the result range covers the full 128 bit space.
From what you've taught us about multipliers, adding one (or in this
case two) extra inputs to the adder that aggregates all the partial >multiplication products will be close to free in time
but the routing
of the extra set of inputs might require an extra cycle?
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
A common set of flags is NZCV. Of these N and Z can be generated from
the 64 ordinary bits (actually N is the MSB of these bits).
You might also want NCZV of 32-bit instructions, but in that case all
flags are derivable from the 64 ordinary bits of the GPR; but in that
case you may need additional branch instructions: Instructions that
check only if the bottom 32-bits are 0 (Z), if bit 31 is 1 (N), if bit
32 is 1 (C), or if bit 32 is different from bit 31 (V).
If you write an architectural rule whereby every integer result is
"proper" one set of bits {top, bottom, dispersed} covers everything.
Proper means that all the bits in the register are written but the
value written is range limited to {Sign}×{Size} of the calculation.
Concerning saving the extra bits across interrupts, yes, this has to
be adapted to the actual architecture, and there are many ways to skin
this cat. I just outlined one to give an idea how this can be done.
On the other hand, with CARRY, none of those bits are needed.
Terje Mathisen <terje.mathisen@tmsw.no> posted:
(hi, lo) = a*b+c+d
Alas:: the best CARRY can do is:
{hi,c} = a*b+hi
simply because this is the largest possible building block that cannot
overflow, the result range covers the full 128 bit space.
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:[...]
My point was that 1-bit of carry is not enough when MUL and IV need >64-bits--and that is the issue CARRY addresses. In addition multi-
width shifts also require <essentially> a whole register of width.
How does a four-input 2048-bit-addition look with your CARRY? For
GPRs-with-flags it would look as follows:
L:
ld xn, (xp)
ld yn, (yp)
ld zn, (zp)
ld tn, (tp)
add rn, xn, yn
addc rn, rn, rm
add sn, zn, tn
addc sn, sn, sm
add vn, rn, tn
addc vn, vn, vm
sd vn, (vp)
.. #mov rn, sn, vn to rm, sm, vm
.. #increment xp yp zp tp vp
.. #loop control and branch back to L:
//pretty close to::
MOV R12,#0
VEC R7,{}
LDD R8,[Rx,Ri<<3]
LDD R9,[Ry,Ri<<3]
LDD R10,[Rz,Ri<<3]
LDD R11,[Rt,Ri<<3]
CARRY R12,{{IO}{IO}{IO}}
ADD R13,R8,R9
ADD R14,R10,R11
ADD R14,R14,R13
STD R14,[Rv,Ri<<3]
LOOP R7,LT,#1,#32
ERROR "unexpected byte sequence starting at index 853: '\xC3'" while decoding:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
A common set of flags is NZCV. Of these N and Z can be generated from
the 64 ordinary bits (actually N is the MSB of these bits).
You might also want NCZV of 32-bit instructions, but in that case all
flags are derivable from the 64 ordinary bits of the GPR; but in that
case you may need additional branch instructions: Instructions that
check only if the bottom 32-bits are 0 (Z), if bit 31 is 1 (N), if bit
32 is 1 (C), or if bit 32 is different from bit 31 (V).
If you write an architectural rule whereby every integer result is
"proper" one set of bits {top, bottom, dispersed} covers everything.
Proper means that all the bits in the register are written but the
value written is range limited to {Sign}Ã{Size} of the calculation.
I have no idea what you mean with "one set of bits {top, bottom,
dispersed}".
As for "proper": Does this mean that one would have to have add(c),
sub(c), mul (madd etc.), shift right and shift left (did I forget
anything?) for i8, i16, i32, i64, u8, u16, u32, and u64? Yes, if
specify in the operation which kind of Z, C/V, and maybe N you are
interested in, you do not need to specify it in the branch that checks
that result; you also eliminate the sign-extension and zero-extension operations that we discussed some time ago.
But given that the operations are much more frequent than branches,
encoding that information in the branches uses less space (for shift
right, the sign is usually included in the operation). It's
interesting that AFAIK there are instruction sets (e.g., Power) that
just have one full-width sign-agnostic add, and do not have
width-specific flags, either. So when compiling stuff like
if (a[1]+a[2] == 0) /* unsigned a[] */
a width-specific compare instruction provides that information. But
gcc generates a compare instruction even when a[] is "unsigned long",
so apparently add does not set the flags on addition anyway (and if
there is an add that sets flags, it is not used by gcc for this code).
Another case is SPARC v9, which tends to set flags. For
if ((a[1]^a[2]) < 0)
I see:
long a[] int a[]
ldx [ %i0 + 8 ], %g1 ld [ %i0 + 4 ], %g2
ldx [ %i0 + 0x10 ], %g2 ld [ %i0 + 8 ], %g1
xor %g1, %g2, %g1 xorcc %g2, %g1, %g0
brlz,pn %g1, 24 <foo+0x24> bl,a,pn %icc, 20 <foo+0x20>
Reading up on SPARC v9, it has two sets of condition codes: 32-bit
(icc) and 64-bit (xcc), and every instruction that sets condition
codes (e.g., xorcc) sets both.
In the present case, the 32-bit
sequence sets the ccs and then checks icc, while the 64-bit sequence
does not set the ccs, and instead uses a branch instruction that
inspects an integer register (%g1). These branch instructions all
work for the full 64 bits, and do not provide a way to check a 32-bit
result. In the present case, an alternate way to use brlz for the
32-bit case would have been:
ldsw [ %i0 + 8 ], %g1 #ld is a synonym for lduw
ldsw [ %i0 + 0x10 ], %g2
xor %g1, %g2, %g1
brlz,pn %g1, 24 <foo+0x24>
because the xor of two sign-extended data is also a correct
sign-extended result, but instread gcc chose to use xorcc and bl %icc.
There are many ways to skin this cat.
Concerning saving the extra bits across interrupts, yes, this has to
be adapted to the actual architecture, and there are many ways to skin
this cat. I just outlined one to give an idea how this can be done.
On the other hand, with CARRY, none of those bits are needed.
But the mechanism of CARRY is quite a bit more involved: Either store
the carry in a GPR at every step, or have another mechanism inside a
CARRY block. And either make the CARRY block atomic or have some way
to preserve the fact that there is this prefix across interrupts and
(worse) synchronous traps.
- anton--- Synchronet 3.21a-Linux NewsLink 1.2
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
Terje Mathisen <terje.mathisen@tmsw.no> posted:
(hi, lo) = a*b+c+d
Alas:: the best CARRY can do is:
{hi,c} = a*b+hi
What latency?
simply because this is the largest possible building block that cannot
overflow, the result range covers the full 128 bit space.
With the carry in the result GPR, you could achieve that as follows:
add t,c,d
umaddc hi,lo,a,b,t
(or split umaddc into an instruction that produces the low result and
one that produces the high result).
The disadvantage here is that, with d being the hi of the last
iteration, you will see the full latency of the add and the umaddh.
- anton--- Synchronet 3.21a-Linux NewsLink 1.2
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
ERROR "unexpected byte sequence starting at index 853: '\xC3'" while decoding:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
A common set of flags is NZCV. Of these N and Z can be generated from >>>> the 64 ordinary bits (actually N is the MSB of these bits).
You might also want NCZV of 32-bit instructions, but in that case all
flags are derivable from the 64 ordinary bits of the GPR; but in that
case you may need additional branch instructions: Instructions that
check only if the bottom 32-bits are 0 (Z), if bit 31 is 1 (N), if bit >>>> 32 is 1 (C), or if bit 32 is different from bit 31 (V).
If you write an architectural rule whereby every integer result is
"proper" one set of bits {top, bottom, dispersed} covers everything.
Proper means that all the bits in the register are written but the
value written is range limited to {Sign}Ã{Size} of the calculation.
I have no idea what you mean with "one set of bits {top, bottom,
dispersed}".
typedef struct { uint64_t reg;
uint8_t bits: 4; } gpr;
or
typedef struct { uint8_t bits: 4;
uint64_t reg;} gpr;
or
typedef struct { uint16_t reg0;
uint8_t bit0: 1;
uint16_t reg1;
uint8_t bit1: 1;
uint16_t reg2;
uint8_t bit2: 1;
uint16_t reg3;
uint8_t bit3: 1; } gpr;
Did you loose every brain-cell of imagination ?!?
As for "proper": Does this mean that one would have to have add(c),
sub(c), mul (madd etc.), shift right and shift left (did I forget
anything?) for i8, i16, i32, i64, u8, u16, u32, and u64? Yes, if
specify in the operation which kind of Z, C/V, and maybe N you are
interested in, you do not need to specify it in the branch that checks
that result; you also eliminate the sign-extension and zero-extension
operations that we discussed some time ago.
{s8, s16, s32, s64, u8, u16, u32, u64} yes.
But given that the operations are much more frequent than branches,
encoding that information in the branches uses less space (for shift
right, the sign is usually included in the operation). It's
Which is why I don't have ANY of those extra bits.
interesting that AFAIK there are instruction sets (e.g., Power) that
just have one full-width sign-agnostic add, and do not have
width-specific flags, either. So when compiling stuff like
if (a[1]+a[2] == 0) /* unsigned a[] */
a width-specific compare instruction provides that information. But
gcc generates a compare instruction even when a[] is "unsigned long",
so apparently add does not set the flags on addition anyway (and if
there is an add that sets flags, it is not used by gcc for this code).
Another case is SPARC v9, which tends to set flags. For
if ((a[1]^a[2]) < 0)
I see:
long a[] int a[]
ldx [ %i0 + 8 ], %g1 ld [ %i0 + 4 ], %g2
ldx [ %i0 + 0x10 ], %g2 ld [ %i0 + 8 ], %g1
xor %g1, %g2, %g1 xorcc %g2, %g1, %g0
brlz,pn %g1, 24 <foo+0x24> bl,a,pn %icc, 20 <foo+0x20>
Reading up on SPARC v9, it has two sets of condition codes: 32-bit
(icc) and 64-bit (xcc), and every instruction that sets condition
codes (e.g., xorcc) sets both.
Another reason its death is helpful to comp.arch
In the present case, the 32-bit
sequence sets the ccs and then checks icc, while the 64-bit sequence
does not set the ccs, and instead uses a branch instruction that
inspects an integer register (%g1). These branch instructions all
work for the full 64 bits, and do not provide a way to check a 32-bit
result. In the present case, an alternate way to use brlz for the
32-bit case would have been:
ldsw [ %i0 + 8 ], %g1 #ld is a synonym for lduw
ldsw [ %i0 + 0x10 ], %g2
xor %g1, %g2, %g1
brlz,pn %g1, 24 <foo+0x24>
because the xor of two sign-extended data is also a correct
sign-extended result, but instread gcc chose to use xorcc and bl %icc.
There are many ways to skin this cat.
Sure:: close to 20-ways, less than 4 of them are "proper".
Concerning saving the extra bits across interrupts, yes, this has to
be adapted to the actual architecture, and there are many ways to skin >>>> this cat. I just outlined one to give an idea how this can be done.
On the other hand, with CARRY, none of those bits are needed.
But the mechanism of CARRY is quite a bit more involved: Either store
the carry in a GPR at every step, or have another mechanism inside a
CARRY block. And either make the CARRY block atomic or have some way
to preserve the fact that there is this prefix across interrupts and
(worse) synchronous traps.
During its "life" the bits used in CARRY are simply another feedback
path on the data-path. Afterwards, carry is written once. CARRY also
gets written when an exception is taken.
- anton
Finding it too difficult to support 128-bit operations using high, low >register pairs. Getting the reservation stations to pair up the
registers seems a bit scary. It would be much simpler to just have
128-bit registers and it appears as if it may not be any more logic.
Sparc v9 died?
Robert Finch <robfi680@gmail.com> writes:
Finding it too difficult to support 128-bit operations using high, low
register pairs. Getting the reservation stations to pair up the
registers seems a bit scary. It would be much simpler to just have
128-bit registers and it appears as if it may not be any more logic.
If you want to support 128-bit operations, using 128-bit registers
certainly is the way to go. Note how AMD used to split 128-bit SSE operations into 64-bit parts on 64-bit registers in the K8, split
256-bit AVX operations into 128-bit parts on 128-bit registers in Zen,
but they went away from that: In Zen4 512-bit operations are performed
in 256-bit-pieces, but the registers are 512 bits wide.
However, the point of carry bits or Mitch Alsup's CARRY is not 128-bit operations, but multi-precision, which can be 256-bit for some crypto,
4096 bits for other crypto, or billions of bits for the stuff that
Alexander Yee is doing.
Sparc v9 died?
Oracle has discontinued SPARC development in 2017, Fujitsu has
announced in 2016 that they switch to ARM A64. Both Oracle and
Fujitsu released their last new SPARC CPU in 2017. Fujitsu has
released the ARM A64-based A64FX in 2019. The Leon4 (2017 according
to <https://en.wikipedia.org/wiki/SPARC#Implementations>) and Leon5
(2019) implement SPARC v8, not v9.
The MCST-R2000 (2018) implements SPARC v9, but will it have a
successor? And even if it has a successor, will it be available in
relevant numbers? MCST is not married to SPARC, despite their name;
they have worked on Elbrus 2000 implementations as well; Elbrus 2000
supports Elbrus VLIW and "Intel x86" instruction sets, and new models
were released in 2018, 2021, and 2025, so MCST now seems to focus on
that.
- anton
Skimming through the SPARC architecture manual I am wondering how they >handle register renaming with a windowed register file. If the register >window file is deep there must be a ginormous number of registers for >renaming. Would it need to keep track of the renames for all the
registers? How does it dump the rename state to memory?
On 2025-11-16 1:36 p.m., MitchAlsup wrote:
-------------------------------
During its "life" the bits used in CARRY are simply another feedback
path on the data-path. Afterwards, carry is written once. CARRY also
gets written when an exception is taken.
- anton
These posts have inspired me to keep working on the ISA. I am on a simplification mission.
The CARRY modifier is just a substitute for not having r3w2 port instructions directly in the ISA. Since Qupls ISA has room to support
some r3w2 instructions directly there is no need for CARRY, much as I
like the idea.
While not using a carry flag in the register, there is still a
capabilities bit, overflow bit and pointer bit plus four user assigned
bits. I decided to just have 72-bit register store and load instructions along with the usual 8,16,32 and 64.
Finding it too difficult to support 128-bit operations using high, low register pairs. Getting the reservation stations to pair up the
registers seems a bit scary.
It would be much simpler to just have
128-bit registers and it appears as if it may not be any more logic. The benefit of using register pairs is the internal busses need only be
64-bits then.
Sparc v9 died?
On 2025-11-17 3:33 a.m., Anton Ertl wrote:
Robert Finch <robfi680@gmail.com> writes:
Finding it too difficult to support 128-bit operations using high, low
register pairs. Getting the reservation stations to pair up the
registers seems a bit scary. It would be much simpler to just have
128-bit registers and it appears as if it may not be any more logic.
If you want to support 128-bit operations, using 128-bit registers certainly is the way to go. Note how AMD used to split 128-bit SSE operations into 64-bit parts on 64-bit registers in the K8, split
256-bit AVX operations into 128-bit parts on 128-bit registers in Zen,
but they went away from that: In Zen4 512-bit operations are performed
in 256-bit-pieces, but the registers are 512 bits wide.
However, the point of carry bits or Mitch Alsup's CARRY is not 128-bit operations, but multi-precision, which can be 256-bit for some crypto,
4096 bits for other crypto, or billions of bits for the stuff that Alexander Yee is doing.
Sparc v9 died?
Oracle has discontinued SPARC development in 2017, Fujitsu has
announced in 2016 that they switch to ARM A64. Both Oracle and
Fujitsu released their last new SPARC CPU in 2017. Fujitsu has
released the ARM A64-based A64FX in 2019. The Leon4 (2017 according
to <https://en.wikipedia.org/wiki/SPARC#Implementations>) and Leon5
(2019) implement SPARC v8, not v9.
The MCST-R2000 (2018) implements SPARC v9, but will it have a
successor? And even if it has a successor, will it be available in relevant numbers? MCST is not married to SPARC, despite their name;
they have worked on Elbrus 2000 implementations as well; Elbrus 2000 supports Elbrus VLIW and "Intel x86" instruction sets, and new models
were released in 2018, 2021, and 2025, so MCST now seems to focus on
that.
- anton
Skimming through the SPARC architecture manual I am wondering how they handle register renaming with a windowed register file. If the register window file is deep there must be a ginormous number of registers for renaming. Would it need to keep track of the renames for all the
registers? How does it dump the rename state to memory?
Tried to find some information on Elbrus. I got page not found a couple
of times. Other than it’s a VLIW machine I do not know much about it.
*****
I would like a machine able to process 128-bit values directly, but it
takes up too many resources. It is easier to make the register file deep
as opposed to wide. BRAM has a max 64-bit width. After that it takes
more BRAMs to get a wider port. I tried a 128-bit wide register file,
but it used about 200 BRAMs. Too many.
There are now 128 logical registers available in Qupls. It turns out
that the BRAM setup is 512 registers deep no matter whether there are
32,64 or 128 registers. So, may as well make them available.
Qupls reservation stations were set up with support for eight operands
(four each for each ½ 128-bit register). The resulting logic was about 25,000 LUTs for just one RS. This is compared to about 5,000 LUTs when
there were just four operands. What gets implemented is considerably
less as most functional units do not need all the operands.
It may be resource efficient to use multiple reservation stations as
opposed to more operands in a single station. But then the operands need
to be linked together between stations. It may be possible using a hash
of the PC value and ROB entry number.
Qupls seems to have an implementation four or five times the size of the FPGA again. Back to the drawing board.
Robert Finch <robfi680@gmail.com> writes:
Skimming through the SPARC architecture manual I am wondering how they >handle register renaming with a windowed register file. If the register >window file is deep there must be a ginormous number of registers for >renaming. Would it need to keep track of the renames for all the >registers? How does it dump the rename state to memory?
There is no need to dump the rename state to memory, not for SPARC nor
for anything else. It's only microarchitectural.
The large number of architected registers may have been a reason why
they needed so long to implement OoO execution.
I think that the cost is typically a register allocation table RAT per
branch (for maybe 50 branches or potential traps that you want to
predict, i.e., 50 RATs).
With 32 architected registers and 257-512
physical registers that's 32*9 bits = 288 bits per RAT; with the 136 architected registers of SPARC, and again <=512 physical registers,
that would be 1224 bits per RAT.
There are probably other options that using a RAT, but I have
forgotten them.
- anton--- Synchronet 3.21a-Linux NewsLink 1.2
I don't remember SPARC ever getting OoO.
The windowed register file
is but one cause.
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Robert Finch <robfi680@gmail.com> writes:
Skimming through the SPARC architecture manual I am wondering how
they handle register renaming with a windowed register file. If
the register window file is deep there must be a ginormous number
of registers for renaming. Would it need to keep track of the
renames for all the registers? How does it dump the rename state
to memory?
I don't remember SPARC ever getting OoO. The windowed register file
is but one cause.
Robert Finch <robfi680@gmail.com> posted:
On 2025-11-17 3:33 a.m., Anton Ertl wrote:
Robert Finch <robfi680@gmail.com> writes:
Finding it too difficult to support 128-bit operations using high, low >>>> register pairs. Getting the reservation stations to pair up the
registers seems a bit scary. It would be much simpler to just have
128-bit registers and it appears as if it may not be any more logic.
If you want to support 128-bit operations, using 128-bit registers
certainly is the way to go. Note how AMD used to split 128-bit SSE
operations into 64-bit parts on 64-bit registers in the K8, split
256-bit AVX operations into 128-bit parts on 128-bit registers in Zen,
but they went away from that: In Zen4 512-bit operations are performed
in 256-bit-pieces, but the registers are 512 bits wide.
However, the point of carry bits or Mitch Alsup's CARRY is not 128-bit
operations, but multi-precision, which can be 256-bit for some crypto,
4096 bits for other crypto, or billions of bits for the stuff that
Alexander Yee is doing.
Sparc v9 died?
Oracle has discontinued SPARC development in 2017, Fujitsu has
announced in 2016 that they switch to ARM A64. Both Oracle and
Fujitsu released their last new SPARC CPU in 2017. Fujitsu has
released the ARM A64-based A64FX in 2019. The Leon4 (2017 according
to <https://en.wikipedia.org/wiki/SPARC#Implementations>) and Leon5
(2019) implement SPARC v8, not v9.
The MCST-R2000 (2018) implements SPARC v9, but will it have a
successor? And even if it has a successor, will it be available in
relevant numbers? MCST is not married to SPARC, despite their name;
they have worked on Elbrus 2000 implementations as well; Elbrus 2000
supports Elbrus VLIW and "Intel x86" instruction sets, and new models
were released in 2018, 2021, and 2025, so MCST now seems to focus on
that.
- anton
Skimming through the SPARC architecture manual I am wondering how they
handle register renaming with a windowed register file. If the register
window file is deep there must be a ginormous number of registers for
renaming. Would it need to keep track of the renames for all the
registers? How does it dump the rename state to memory?
Tried to find some information on Elbrus. I got page not found a couple
of times. Other than it’s a VLIW machine I do not know much about it.
*****
I would like a machine able to process 128-bit values directly, but it
takes up too many resources. It is easier to make the register file deep
as opposed to wide. BRAM has a max 64-bit width. After that it takes
more BRAMs to get a wider port. I tried a 128-bit wide register file,
but it used about 200 BRAMs. Too many.
There are now 128 logical registers available in Qupls. It turns out
that the BRAM setup is 512 registers deep no matter whether there are
32,64 or 128 registers. So, may as well make them available.
Can you read BRAM 2× or 4× per CPU cycle ?!?
Whew! After several tries I think I found a much better way of doing
Qupls reservation stations were set up with support for eight operands
(four each for each ½ 128-bit register). The resulting logic was about
25,000 LUTs for just one RS. This is compared to about 5,000 LUTs when
there were just four operands. What gets implemented is considerably
less as most functional units do not need all the operands.
Ok, you found one way NOT to DO IT.
It may be resource efficient to use multiple reservation stations as
opposed to more operands in a single station. But then the operands need
to be linked together between stations. It may be possible using a hash
of the PC value and ROB entry number.
Allow me to dissuade you from this.
Qupls seems to have an implementation four or five times the size of the
FPGA again. Back to the drawing board.
Live within your means.
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
There is no need to dump the rename state to memory, not for SPARC nor
for anything else. It's only microarchitectural.
It does need to be checkpointed if/when going OoO.
With 32 architected registers and 257-512
physical registers that's 32*9 bits = 288 bits per RAT; with the 136
architected registers of SPARC, and again <=512 physical registers,
that would be 1224 bits per RAT.
Register files with more than 128 entries become big and especially SLOW.
The first production OoO SPARC was HAL SPARC64 manufactured for
Fujitsu on Fujitsu's own fabs back in 1995, so contemporary of PPro. It
was 4-die chipset.
HAL SPARC64-GP was first single-chip implementation in 1997. >https://en.wikipedia.org/wiki/HAL_SPARC64
The line was continued by Fujitsu:
https://en.wikipedia.org/wiki/SPARC64_V
On 2025-11-16 1:36 p.m., MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
ERROR "unexpected byte sequence starting at index 853: '\xC3'" while
decoding:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
A common set of flags is NZCV. Of these N and Z can be generated from >>>>> the 64 ordinary bits (actually N is the MSB of these bits).
You might also want NCZV of 32-bit instructions, but in that case all >>>>> flags are derivable from the 64 ordinary bits of the GPR; but in that >>>>> case you may need additional branch instructions: Instructions that
check only if the bottom 32-bits are 0 (Z), if bit 31 is 1 (N), if bit >>>>> 32 is 1 (C), or if bit 32 is different from bit 31 (V).
If you write an architectural rule whereby every integer result is
"proper" one set of bits {top, bottom, dispersed} covers everything.
Proper means that all the bits in the register are written but the
value written is range limited to {Sign}Ã{Size} of the calculation.
I have no idea what you mean with "one set of bits {top, bottom,
dispersed}".
typedef struct { uint64_t reg;
uint8_t bits: 4; } gpr;
or
typedef struct { uint8_t bits: 4;
uint64_t reg;} gpr;
or
typedef struct { uint16_t reg0;
uint8_t bit0: 1;
uint16_t reg1;
uint8_t bit1: 1;
uint16_t reg2;
uint8_t bit2: 1;
uint16_t reg3;
uint8_t bit3: 1; } gpr;
Did you loose every brain-cell of imagination ?!?
As for "proper": Does this mean that one would have to have add(c),
sub(c), mul (madd etc.), shift right and shift left (did I forget
anything?) for i8, i16, i32, i64, u8, u16, u32, and u64? Yes, if
specify in the operation which kind of Z, C/V, and maybe N you are
interested in, you do not need to specify it in the branch that checks
that result; you also eliminate the sign-extension and zero-extension
operations that we discussed some time ago.
{s8, s16, s32, s64, u8, u16, u32, u64} yes.
But given that the operations are much more frequent than branches,
encoding that information in the branches uses less space (for shift
right, the sign is usually included in the operation). It's
Which is why I don't have ANY of those extra bits.
interesting that AFAIK there are instruction sets (e.g., Power) that
just have one full-width sign-agnostic add, and do not have
width-specific flags, either. So when compiling stuff like
if (a[1]+a[2] == 0) /* unsigned a[] */
a width-specific compare instruction provides that information. But
gcc generates a compare instruction even when a[] is "unsigned long",
so apparently add does not set the flags on addition anyway (and if
there is an add that sets flags, it is not used by gcc for this code).
Another case is SPARC v9, which tends to set flags. For
if ((a[1]^a[2]) < 0)
I see:
long a[] int a[]
ldx [ %i0 + 8 ], %g1 ld [ %i0 + 4 ], %g2
ldx [ %i0 + 0x10 ], %g2 ld [ %i0 + 8 ], %g1
xor %g1, %g2, %g1 xorcc %g2, %g1, %g0
brlz,pn %g1, 24 <foo+0x24> bl,a,pn %icc, 20 <foo+0x20>
Reading up on SPARC v9, it has two sets of condition codes: 32-bit
(icc) and 64-bit (xcc), and every instruction that sets condition
codes (e.g., xorcc) sets both.
Another reason its death is helpful to comp.arch
In the present case, the 32-bit
sequence sets the ccs and then checks icc, while the 64-bit sequence
does not set the ccs, and instead uses a branch instruction that
inspects an integer register (%g1). These branch instructions all
work for the full 64 bits, and do not provide a way to check a 32-bit
result. In the present case, an alternate way to use brlz for the
32-bit case would have been:
ldsw [ %i0 + 8 ], %g1 #ld is a synonym for lduw
ldsw [ %i0 + 0x10 ], %g2
xor %g1, %g2, %g1
brlz,pn %g1, 24 <foo+0x24>
because the xor of two sign-extended data is also a correct
sign-extended result, but instread gcc chose to use xorcc and bl %icc.
There are many ways to skin this cat.
Sure:: close to 20-ways, less than 4 of them are "proper".
Concerning saving the extra bits across interrupts, yes, this has to >>>>> be adapted to the actual architecture, and there are many ways to skin >>>>> this cat. I just outlined one to give an idea how this can be done. >>>>On the other hand, with CARRY, none of those bits are needed.
But the mechanism of CARRY is quite a bit more involved: Either store
the carry in a GPR at every step, or have another mechanism inside a
CARRY block. And either make the CARRY block atomic or have some way
to preserve the fact that there is this prefix across interrupts and
(worse) synchronous traps.
During its "life" the bits used in CARRY are simply another feedback
path on the data-path. Afterwards, carry is written once. CARRY also
gets written when an exception is taken.
- anton
These posts have inspired me to keep working on the ISA. I am on a simplification mission.
The CARRY modifier is just a substitute for not having r3w2 port instructions directly in the ISA. Since Qupls ISA has room to support
some r3w2 instructions directly there is no need for CARRY, much as I
like the idea.
While not using a carry flag in the register, there is still a
capabilities bit, overflow bit and pointer bit plus four user assigned
bits. I decided to just have 72-bit register store and load instructions along with the usual 8,16,32 and 64.
Finding it too difficult to support 128-bit operations using high, low register pairs. Getting the reservation stations to pair up the
registers seems a bit scary. It would be much simpler to just have 128-
bit registers and it appears as if it may not be any more logic. The
benefit of using register pairs is the internal busses need only be 64-
bits then.
Sparc v9 died?
Robert Finch <robfi680@gmail.com> posted:
On 2025-11-16 1:36 p.m., MitchAlsup wrote:
-------------------------------
During its "life" the bits used in CARRY are simply another feedback
path on the data-path. Afterwards, carry is written once. CARRY also
gets written when an exception is taken.
- anton
These posts have inspired me to keep working on the ISA. I am on a
simplification mission.
The CARRY modifier is just a substitute for not having r3w2 port
instructions directly in the ISA. Since Qupls ISA has room to support
some r3w2 instructions directly there is no need for CARRY, much as I
like the idea.
That is correct at the 95% level.
While not using a carry flag in the register, there is still a
capabilities bit, overflow bit and pointer bit plus four user assigned
bits. I decided to just have 72-bit register store and load instructions
along with the usual 8,16,32 and 64.
Finding it too difficult to support 128-bit operations using high, low
register pairs. Getting the reservation stations to pair up the
registers seems a bit scary.
It IS scary and hard and tricky to get right.
It would be much simpler to just have
128-bit registers and it appears as if it may not be any more logic. The
benefit of using register pairs is the internal busses need only be
64-bits then.
Almost exactly what we did in Mc 88120 when facing the same problem.
Except we kept the 32-bit model and had register files 2 registers
tall {even, odd},{odd even} so any register specifier would simply
read out the status and values of both registers and then let the
stations handle the insundry problems.
Sparc v9 died?
What was the last year SPARC sold more than 100,000 CPUs ??
Pretty sure SPARC is good and dead at this point...
Many others in this space are not far behind.
Basically, anything remaining needs to compete against ARM and RISC-V
(the latter of which making an unexpectedly rapid rise in mind-share and prominence...).
BGB <cr88192@gmail.com> schrieb:
Pretty sure SPARC is good and dead at this point...
Almost, but not quite. I still have login on a couple of SPARC
machines:
On 11/17/2025 1:49 AM, Robert Finch wrote:
On 2025-11-16 1:36 p.m., MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
ERROR "unexpected byte sequence starting at index 853: '\xC3'" while
decoding:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
I have no idea what you mean with "one set of bits {top, bottom,
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
A common set of flags is NZCV. Of these N and Z can be generated >>>>>> from
the 64 ordinary bits (actually N is the MSB of these bits).
You might also want NCZV of 32-bit instructions, but in that case all >>>>>> flags are derivable from the 64 ordinary bits of the GPR; but in that >>>>>> case you may need additional branch instructions: Instructions that >>>>>> check only if the bottom 32-bits are 0 (Z), if bit 31 is 1 (N), if >>>>>> bit
32 is 1 (C), or if bit 32 is different from bit 31 (V).
If you write an architectural rule whereby every integer result is
"proper" one set of bits {top, bottom, dispersed} covers everything. >>>>>
Proper means that all the bits in the register are written but the
value written is range limited to {Sign}Ã{Size} of the calculation. >>>>
dispersed}".
typedef struct { uint64_t reg;
uint8_t bits: 4; } gpr;
or
typedef struct { uint8_t bits: 4;
uint64_t reg;} gpr;
or
typedef struct { uint16_t reg0;
uint8_t bit0: 1;
uint16_t reg1;
uint8_t bit1: 1;
uint16_t reg2;
uint8_t bit2: 1;
uint16_t reg3;
uint8_t bit3: 1; } gpr;
Did you loose every brain-cell of imagination ?!?
As for "proper": Does this mean that one would have to have add(c),
sub(c), mul (madd etc.), shift right and shift left (did I forget
anything?) for i8, i16, i32, i64, u8, u16, u32, and u64? Yes, if
specify in the operation which kind of Z, C/V, and maybe N you are
interested in, you do not need to specify it in the branch that checks >>>> that result; you also eliminate the sign-extension and zero-extension
operations that we discussed some time ago.
{s8, s16, s32, s64, u8, u16, u32, u64} yes.
But given that the operations are much more frequent than branches,
encoding that information in the branches uses less space (for shift
right, the sign is usually included in the operation). It's
Which is why I don't have ANY of those extra bits.
interesting that AFAIK there are instruction sets (e.g., Power) that
just have one full-width sign-agnostic add, and do not have
width-specific flags, either. So when compiling stuff like
if (a[1]+a[2] == 0) /* unsigned a[] */
a width-specific compare instruction provides that information. But
gcc generates a compare instruction even when a[] is "unsigned long",
so apparently add does not set the flags on addition anyway (and if
there is an add that sets flags, it is not used by gcc for this code). >>>>
Another case is SPARC v9, which tends to set flags. For
if ((a[1]^a[2]) < 0)
I see:
long a[] int a[]
ldx [ %i0 + 8 ], %g1 ld [ %i0 + 4 ], %g2
ldx [ %i0 + 0x10 ], %g2 ld [ %i0 + 8 ], %g1
xor %g1, %g2, %g1 xorcc %g2, %g1, %g0
brlz,pn %g1, 24 <foo+0x24> bl,a,pn %icc, 20 <foo+0x20>
Reading up on SPARC v9, it has two sets of condition codes: 32-bit
(icc) and 64-bit (xcc), and every instruction that sets condition
codes (e.g., xorcc) sets both.
Another reason its death is helpful to comp.arch
In the present case, the 32-bit
sequence sets the ccs and then checks icc, while the 64-bit sequence
does not set the ccs, and instead uses a branch instruction that
inspects an integer register (%g1). These branch instructions all
work for the full 64 bits, and do not provide a way to check a 32-bit
result. In the present case, an alternate way to use brlz for the
32-bit case would have been:
ldsw [ %i0 + 8 ], %g1 #ld is a synonym for lduw
ldsw [ %i0 + 0x10 ], %g2
xor %g1, %g2, %g1
brlz,pn %g1, 24 <foo+0x24>
because the xor of two sign-extended data is also a correct
sign-extended result, but instread gcc chose to use xorcc and bl %icc. >>>>
There are many ways to skin this cat.
Sure:: close to 20-ways, less than 4 of them are "proper".
Concerning saving the extra bits across interrupts, yes, this has to >>>>>> be adapted to the actual architecture, and there are many ways to >>>>>> skinOn the other hand, with CARRY, none of those bits are needed.
this cat. I just outlined one to give an idea how this can be done. >>>>>
But the mechanism of CARRY is quite a bit more involved: Either store
the carry in a GPR at every step, or have another mechanism inside a
CARRY block. And either make the CARRY block atomic or have some way >>>> to preserve the fact that there is this prefix across interrupts and
(worse) synchronous traps.
During its "life" the bits used in CARRY are simply another feedback
path on the data-path. Afterwards, carry is written once. CARRY also
gets written when an exception is taken.
- anton
These posts have inspired me to keep working on the ISA. I am on a
simplification mission.
The CARRY modifier is just a substitute for not having r3w2 port
instructions directly in the ISA. Since Qupls ISA has room to support
some r3w2 instructions directly there is no need for CARRY, much as I
like the idea.
While not using a carry flag in the register, there is still a
capabilities bit, overflow bit and pointer bit plus four user assigned
bits. I decided to just have 72-bit register store and load
instructions along with the usual 8,16,32 and 64.
Finding it too difficult to support 128-bit operations using high, low
register pairs. Getting the reservation stations to pair up the
registers seems a bit scary. It would be much simpler to just have
128- bit registers and it appears as if it may not be any more logic.
The benefit of using register pairs is the internal busses need only
be 64- bits then.
I went with pairs, but I guess maybe pairs are a lot easier for in-order than OoO.
Sparc v9 died?
Pretty sure SPARC is good and dead at this point...
Many others in this space are not far behind.
Basically, anything remaining needs to compete against ARM and RISC-V
(the latter of which making an unexpectedly rapid rise in mind-share and prominence...).
Is the need for backwards compatibility killing things as technology has improved?
There seems to be a lot more known good/bad approaches making
me think that the lifetime of newer designs could be longer.
environment evolution
There seems to be a lot more known good/bad approaches making
me think that the lifetime of newer designs could be longer.
Yes, but the people making the decisions are still to young to have
the history needed to make better decisions.
The graduates of major universities go right out and start designing
without being exposed to "enough" of the disease of computer architecture
to be in a position to understand why feature.X of arch.Y was bad overall,
or why feature.X of architecture.Y was not enough to save it.
Each generation reaches employment after university at about the same
level as we did when we invented RISC.
BGB <cr88192@gmail.com> posted:
On 11/13/2025 3:58 PM, Anton Ertl wrote:
BGB <cr88192@gmail.com> writes:
Can note that GCC seemingly doesn't support 128-bit integers on 64-bit >>>> RISC-V.
What makes you think so? It has certainly worked every time I tried
it. E.g., Gforth's "configure" reports:
checking size of __int128_t... 16
checking size of __uint128_t... 16
[...]
checking for a C type for double-cells... __int128_t
checking for a C type for unsigned double-cells... __uint128_t
That's with gcc 10.3.1
Hmm...
Seems so.
Testing again, it does appear to work; the error message I thought I
remembered seeing, instead applied to when trying to use the type in
MSVC. I had thought I remembered checking before and it failing, but it
seems not.
But, yeah, good to know I guess.
As for MSVC:
tst_int128.c(5): error C4235: nonstandard extension used: '__int128'
keyword not supported on this architecture
ERRRRRRR:: not supported by this compiler, the architecture has
ISA level support for doing this, but the compiler does not allow
you access.
Power's not dead, either, if very highly priced.
MIPS is still
being sold, apparently.
As for RISC-V,
I am not sure how much business they actually generate compared
to others.
Is the need for backwards compatibility killing things as technology has >improved?
I recently heard that CS graduates from ETH Zürich had heard about >pipelines, but thought it was fetch-decode-execute.
They also did not know about DEC or the VAX. Sic transit gloria
mundi...
Michael S <already5chosen@yahoo.com> posted:
Not really.
That is, conversions are not blazingly fast, but still much better
than any attempt to divide in any form of decimal. And helps to
preserve your sanity.
Are you trying to pull our proverbial leg here ?!?
On Thu, 13 Nov 2025 19:04:18 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Michael S <already5chosen@yahoo.com> posted:
Not really.
That is, conversions are not blazingly fast, but still much better
than any attempt to divide in any form of decimal. And helps to
preserve your sanity.
Are you trying to pull our proverbial leg here ?!?
After reading paragraph 5.2 of IEEE-754-2008 Standard I am less sure in correctness of my above statement.
For the case of exact division, preservation of mental sanity during fulfillment of requirements of this paragraph is far from simple,
regardless of numeric base used in the process.
On 11/21/2025 7:31 AM, Michael S wrote:
On Thu, 13 Nov 2025 19:04:18 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Michael S <already5chosen@yahoo.com> posted:
Not really.
That is, conversions are not blazingly fast, but still much better
than any attempt to divide in any form of decimal. And helps to
preserve your sanity.
Are you trying to pull our proverbial leg here ?!?
After reading paragraph 5.2 of IEEE-754-2008 Standard I am less sure in
correctness of my above statement.
For the case of exact division, preservation of mental sanity during
fulfillment of requirements of this paragraph is far from simple,
regardless of numeric base used in the process.
One effectively needs to do a special extra-wide divide rather than just
a normal integer divide, etc.
But, yeah, fastest I had gotten in my experiments was radix-10e9 long- division, but still not the fastest option.
So, rough ranking, fast to slow:
Radix-10e9 Long Divide (fastest)
Newton-Raphson
Radix-10 Long Divide
Integer Shift-Subtract with converters (slowest).
Fastest converter strategy ATM:
Radix-10e9 double-dabble (Int->Dec).
MUL-by-10e9 and ADD (Dec->Int)
Fastest strategy: Unrolled Shifts and ADDs (*1).
*1: While it is possible to perform a 128-bit multiply decomposing into multiplying 32-bit parts and adding them together; it was working out slightly faster in this case to do a fixed multiply by decomposing it
into a series of explicit shifts and ADDs.
Though, in this case, it is faster (and less ugly) to decompose this
into a pattern of iteratively multiplying by smaller amounts. I had
ended up using 4x multiply by 100 followed by multiply by 10, as while
not the fastest strategy, needs less code than 2x multiply by 10000 + multiply by 10. Most other patterns would need more shifts and adds.
In theory, x86-64 could do it better with multiply ops, but getting something optimal out of the C compilers is a bigger issue here it seems.
Unexplored options:
Radix 10e2 (byte)
Radix 10e3 (word)
Radix 10e4 (word)
Radix 10e3 could have the closest to direct mapping to DPD.
Looking at the decNumber code, it appears also to be Radix-10e9 based.
They also do significant (ab)use of the C preprocessor.
Apparently, "Why use functions when you can use macros?"...
For the Radix-10e9 long-divide, part of the magic was in the function to scale a value by a radix value and subtract it from another array.
Ended up trying a few options, fastest was to temporarily turn the
operation into non-normalized 64-bit pieces and then normalize the
result (borrow propagation, etc) as an output step.
Initial attempt kept it normalized within the operation, which was slower.
It was seemingly compiler-dependent whether it was faster to do a
combined operation, or separate scale and subtract, but the margins were small. On MSVC the combined operation was slightly faster than the
separate operations.
...
Otherwise, after this, just went and fiddled with BGBCC some more,
adding more options for its resource converter.
Had before (for image formats):
In: TGA, BMP (various), PNG, QOI, UPIC
Out: BMP (various), QOI, UPIC
Added (now):
In: PPM, JPG, DDS
Out: PNG, JPG, DDS (DXT1 and DXT5)
Considered (not added yet):
PCX
Evaluated PCX, possible but not a clear win.
Fiddled with making the PNG encoder less slow, mostly this was tweaking
some parameters for the LZ searches. Initial settings were using deeper searches over initially smaller sliding windows (at lower compression levels); better in this case to do a shallower search over a max-sized sliding window.
ATM, speed of PNG is now on-par with the JPG encoder (still one of the slower options).
For simple use-cases, PNG still loses (in terms of both speed and compression) to 16-color BMP + LZ compression (LZ4 or RP2).
Theoretically, indexed-color PNG exists, but is less widely supported.
It is less space-efficient to represent 16-colors as Deflate-compressed color differences than it is to just represent the 4-bit RGBI values directly.
However, can note that the RLE compression scheme (used by PCX) is
clearly inferior to that of any sort of LZ compression.
Comparably, PNG is also a more expensive format to decode as well (even
vs JPEG).
UPIC can partly address the use-cases of both PNG and JPEG while being cheaper to decode than either, but more niche as pretty much nothing supports it. Some of its design and properties being mostly JPEG-like.
QOI is interesting, but suffers some similar limitations to PCX (its
design is mostly about more compactly encoding color-differences in true-color images and otherwise only offers RLE compression).
QOI is not particularly effective against images with little variety in color variation but lots of repeating patterns (I have a modified QOI
that does a little better here, still not particularly effective with 16-color graphics though).
Otherwise, also added up adding a small text format for image drawing commands.
As a simplistic line oriented format containing various commands to
perform drawing operations or composite images.
creating a "canvas"
setting the working color
drawing lines
bucket fill
drawing text strings
overlaying other images
...
This is maybe (debatable) outside the scope of a C compiler, but could
have use-cases for preparing resource data (nevermind if scope creep is partly also turning it into an asset-packer tool; where it is useful to
make graphics/sounds/etc in one set of formats and then process and
convert them into another set of files, usually inside of some sort of
VFS image or similar).
Design is much more simplistic than something like SVG and I am
currently assuming its use for mostly hand-edited files. Unlike SVG, it
also assumes drawing to a pixel grid rather than some more abstract coordinate space (so, its abstract model is more like "MS Paint" or similar); also SVG would suck as a human-edited format.
Granted, one could argue maybe it could make scope that asset-processing
is its own tool, then one converts it to a format that the compiler
accepts (WAD2 or WAD4 in this case) prior to compiling the main binary (and/or not use resource data).
Still, IMO, an internal WAD image is still better than the horrid/
unusable mess that Windows had used (where anymore most people don't
bother with the resource section much more than storing a program icon
or similar...).
But, realistically, one does still want to limit how much data they
stick into the EXE.
...
On 2025-11-21 2:36 p.m., BGB wrote:
On 11/21/2025 7:31 AM, Michael S wrote:My forays into the world of graphics formats are pretty limited. I tend
On Thu, 13 Nov 2025 19:04:18 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Michael S <already5chosen@yahoo.com> posted:
Not really.
That is, conversions are not blazingly fast, but still much better
than any attempt to divide in any form of decimal. And helps to
preserve your sanity.
Are you trying to pull our proverbial leg here ?!?
After reading paragraph 5.2 of IEEE-754-2008 Standard I am less sure in
correctness of my above statement.
For the case of exact division, preservation of mental sanity during
fulfillment of requirements of this paragraph is far from simple,
regardless of numeric base used in the process.
One effectively needs to do a special extra-wide divide rather than
just a normal integer divide, etc.
But, yeah, fastest I had gotten in my experiments was radix-10e9 long-
division, but still not the fastest option.
So, rough ranking, fast to slow:
Radix-10e9 Long Divide (fastest)
Newton-Raphson
Radix-10 Long Divide
Integer Shift-Subtract with converters (slowest).
Fastest converter strategy ATM:
Radix-10e9 double-dabble (Int->Dec).
MUL-by-10e9 and ADD (Dec->Int)
Fastest strategy: Unrolled Shifts and ADDs (*1).
*1: While it is possible to perform a 128-bit multiply decomposing
into multiplying 32-bit parts and adding them together; it was working
out slightly faster in this case to do a fixed multiply by decomposing
it into a series of explicit shifts and ADDs.
Though, in this case, it is faster (and less ugly) to decompose this
into a pattern of iteratively multiplying by smaller amounts. I had
ended up using 4x multiply by 100 followed by multiply by 10, as while
not the fastest strategy, needs less code than 2x multiply by 10000 +
multiply by 10. Most other patterns would need more shifts and adds.
In theory, x86-64 could do it better with multiply ops, but getting
something optimal out of the C compilers is a bigger issue here it seems.
Unexplored options:
Radix 10e2 (byte)
Radix 10e3 (word)
Radix 10e4 (word)
Radix 10e3 could have the closest to direct mapping to DPD.
Looking at the decNumber code, it appears also to be Radix-10e9 based.
They also do significant (ab)use of the C preprocessor.
Apparently, "Why use functions when you can use macros?"...
For the Radix-10e9 long-divide, part of the magic was in the function
to scale a value by a radix value and subtract it from another array.
Ended up trying a few options, fastest was to temporarily turn the
operation into non-normalized 64-bit pieces and then normalize the
result (borrow propagation, etc) as an output step.
Initial attempt kept it normalized within the operation, which was
slower.
It was seemingly compiler-dependent whether it was faster to do a
combined operation, or separate scale and subtract, but the margins
were small. On MSVC the combined operation was slightly faster than
the separate operations.
...
Otherwise, after this, just went and fiddled with BGBCC some more,
adding more options for its resource converter.
Had before (for image formats):
In: TGA, BMP (various), PNG, QOI, UPIC
Out: BMP (various), QOI, UPIC
Added (now):
In: PPM, JPG, DDS
Out: PNG, JPG, DDS (DXT1 and DXT5)
Considered (not added yet):
PCX
Evaluated PCX, possible but not a clear win.
Fiddled with making the PNG encoder less slow, mostly this was
tweaking some parameters for the LZ searches. Initial settings were
using deeper searches over initially smaller sliding windows (at lower
compression levels); better in this case to do a shallower search over
a max-sized sliding window.
ATM, speed of PNG is now on-par with the JPG encoder (still one of the
slower options).
For simple use-cases, PNG still loses (in terms of both speed and
compression) to 16-color BMP + LZ compression (LZ4 or RP2).
Theoretically, indexed-color PNG exists, but is less widely supported.
It is less space-efficient to represent 16-colors as Deflate-
compressed color differences than it is to just represent the 4-bit
RGBI values directly.
However, can note that the RLE compression scheme (used by PCX) is
clearly inferior to that of any sort of LZ compression.
Comparably, PNG is also a more expensive format to decode as well
(even vs JPEG).
UPIC can partly address the use-cases of both PNG and JPEG while being
cheaper to decode than either, but more niche as pretty much nothing
supports it. Some of its design and properties being mostly JPEG-like.
QOI is interesting, but suffers some similar limitations to PCX (its
design is mostly about more compactly encoding color-differences in
true-color images and otherwise only offers RLE compression).
QOI is not particularly effective against images with little variety
in color variation but lots of repeating patterns (I have a modified
QOI that does a little better here, still not particularly effective
with 16-color graphics though).
Otherwise, also added up adding a small text format for image drawing
commands.
As a simplistic line oriented format containing various commands to
perform drawing operations or composite images.
creating a "canvas"
setting the working color
drawing lines
bucket fill
drawing text strings
overlaying other images
...
This is maybe (debatable) outside the scope of a C compiler, but could
have use-cases for preparing resource data (nevermind if scope creep
is partly also turning it into an asset-packer tool; where it is
useful to make graphics/sounds/etc in one set of formats and then
process and convert them into another set of files, usually inside of
some sort of VFS image or similar).
Design is much more simplistic than something like SVG and I am
currently assuming its use for mostly hand-edited files. Unlike SVG,
it also assumes drawing to a pixel grid rather than some more abstract
coordinate space (so, its abstract model is more like "MS Paint" or
similar); also SVG would suck as a human-edited format.
Granted, one could argue maybe it could make scope that asset-
processing is its own tool, then one converts it to a format that the
compiler accepts (WAD2 or WAD4 in this case) prior to compiling the
main binary (and/or not use resource data).
Still, IMO, an internal WAD image is still better than the horrid/
unusable mess that Windows had used (where anymore most people don't
bother with the resource section much more than storing a program icon
or similar...).
But, realistically, one does still want to limit how much data they
stick into the EXE.
...
to use libraries already written by other people. I assume people a lot brighter than myself have come up with them.
A while ago I wrote a set of graphics routines in assembler that were
quite fast. One format I have delt with is the .flic file format used to render animated graphics. I wanted to write my own CIV style game. It
took a little bit of research and some reverse engineering. Apparently,
the authors used a modified version of the format making it difficult to
use the CIV graphics in my own game. I never could get it to render as
fast as the game’s engine. I wrote the code for my game in C or C++, the original’s game engine code was likely in a different language.
*****
Been working on vectors for the ISA. I split the vector length register
into eight sections to define up to eight different vector lengths. The first five are defined for integer, float, fixed, character, and address data types. I figure one may want to use vectors of different lengths at
the same time, for instance to address data using byte offsets, while
the data itself might be a float. The vector load / store instructions accept a data type to load / store and always use the address type for address calculations.
There is also a vector lane size register split up the same way. I had thought of giving each vector register its own format for length and
lane size. But thought that is a bit much, with limited use cases.
I think I can get away with only two load and two store instructions.
One to do a strided load and a second to do an vector indexed load (gather/scatter). The addressing mode in use is d[Rbase+Rindex*Scale].
Where Rindex is used as the stride when scalar or as a supplier of the
lane offset when Rindex is a vector.
Writing the RTL code to support the vector memory ops has been
challenging. Using a simple approach ATM. The instruction needs to be re-issued for each vector lane accessed. Unaligned vector loads and
stores are also allowed, adding some complexity when the operation
crosses a cache-line boundary.
I have the max vector length and max vector size constants returned by
the GETINFO instruction which returns CPU specific information.
On 11/21/2025 7:31 AM, Michael S wrote:
On Thu, 13 Nov 2025 19:04:18 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Michael S <already5chosen@yahoo.com> posted:
Not really.
That is, conversions are not blazingly fast, but still much better
than any attempt to divide in any form of decimal. And helps to
preserve your sanity.
Are you trying to pull our proverbial leg here ?!?
After reading paragraph 5.2 of IEEE-754-2008 Standard I am less
sure in correctness of my above statement.
For the case of exact division, preservation of mental sanity during fulfillment of requirements of this paragraph is far from simple, regardless of numeric base used in the process.
One effectively needs to do a special extra-wide divide rather than
just a normal integer divide, etc.
On 11/21/2025 9:09 PM, Robert Finch wrote:
On 2025-11-21 2:36 p.m., BGB wrote:
On 11/21/2025 7:31 AM, Michael S wrote:My forays into the world of graphics formats are pretty limited. I
On Thu, 13 Nov 2025 19:04:18 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Michael S <already5chosen@yahoo.com> posted:
Not really.
That is, conversions are not blazingly fast, but still much better >>>>>> than any attempt to divide in any form of decimal. And helps to
preserve your sanity.
Are you trying to pull our proverbial leg here ?!?
After reading paragraph 5.2 of IEEE-754-2008 Standard I am less sure in >>>> correctness of my above statement.
For the case of exact division, preservation of mental sanity during
fulfillment of requirements of this paragraph is far from simple,
regardless of numeric base used in the process.
One effectively needs to do a special extra-wide divide rather than
just a normal integer divide, etc.
But, yeah, fastest I had gotten in my experiments was radix-10e9
long- division, but still not the fastest option.
So, rough ranking, fast to slow:
Radix-10e9 Long Divide (fastest)
Newton-Raphson
Radix-10 Long Divide
Integer Shift-Subtract with converters (slowest).
Fastest converter strategy ATM:
Radix-10e9 double-dabble (Int->Dec).
MUL-by-10e9 and ADD (Dec->Int)
Fastest strategy: Unrolled Shifts and ADDs (*1).
*1: While it is possible to perform a 128-bit multiply decomposing
into multiplying 32-bit parts and adding them together; it was
working out slightly faster in this case to do a fixed multiply by
decomposing it into a series of explicit shifts and ADDs.
Though, in this case, it is faster (and less ugly) to decompose this
into a pattern of iteratively multiplying by smaller amounts. I had
ended up using 4x multiply by 100 followed by multiply by 10, as
while not the fastest strategy, needs less code than 2x multiply by
10000 + multiply by 10. Most other patterns would need more shifts
and adds.
In theory, x86-64 could do it better with multiply ops, but getting
something optimal out of the C compilers is a bigger issue here it
seems.
Unexplored options:
Radix 10e2 (byte)
Radix 10e3 (word)
Radix 10e4 (word)
Radix 10e3 could have the closest to direct mapping to DPD.
Looking at the decNumber code, it appears also to be Radix-10e9 based.
They also do significant (ab)use of the C preprocessor.
Apparently, "Why use functions when you can use macros?"...
For the Radix-10e9 long-divide, part of the magic was in the function
to scale a value by a radix value and subtract it from another array.
Ended up trying a few options, fastest was to temporarily turn the
operation into non-normalized 64-bit pieces and then normalize the
result (borrow propagation, etc) as an output step.
Initial attempt kept it normalized within the operation, which was
slower.
It was seemingly compiler-dependent whether it was faster to do a
combined operation, or separate scale and subtract, but the margins
were small. On MSVC the combined operation was slightly faster than
the separate operations.
...
Otherwise, after this, just went and fiddled with BGBCC some more,
adding more options for its resource converter.
Had before (for image formats):
In: TGA, BMP (various), PNG, QOI, UPIC
Out: BMP (various), QOI, UPIC
Added (now):
In: PPM, JPG, DDS
Out: PNG, JPG, DDS (DXT1 and DXT5)
Considered (not added yet):
PCX
Evaluated PCX, possible but not a clear win.
Fiddled with making the PNG encoder less slow, mostly this was
tweaking some parameters for the LZ searches. Initial settings were
using deeper searches over initially smaller sliding windows (at
lower compression levels); better in this case to do a shallower
search over a max-sized sliding window.
ATM, speed of PNG is now on-par with the JPG encoder (still one of
the slower options).
For simple use-cases, PNG still loses (in terms of both speed and
compression) to 16-color BMP + LZ compression (LZ4 or RP2).
Theoretically, indexed-color PNG exists, but is less widely supported.
It is less space-efficient to represent 16-colors as Deflate-
compressed color differences than it is to just represent the 4-bit
RGBI values directly.
However, can note that the RLE compression scheme (used by PCX) is
clearly inferior to that of any sort of LZ compression.
Comparably, PNG is also a more expensive format to decode as well
(even vs JPEG).
UPIC can partly address the use-cases of both PNG and JPEG while
being cheaper to decode than either, but more niche as pretty much
nothing supports it. Some of its design and properties being mostly
JPEG-like.
QOI is interesting, but suffers some similar limitations to PCX (its
design is mostly about more compactly encoding color-differences in
true-color images and otherwise only offers RLE compression).
QOI is not particularly effective against images with little variety
in color variation but lots of repeating patterns (I have a modified
QOI that does a little better here, still not particularly effective
with 16-color graphics though).
Otherwise, also added up adding a small text format for image drawing
commands.
As a simplistic line oriented format containing various commands to
perform drawing operations or composite images.
creating a "canvas"
setting the working color
drawing lines
bucket fill
drawing text strings
overlaying other images
...
This is maybe (debatable) outside the scope of a C compiler, but
could have use-cases for preparing resource data (nevermind if scope
creep is partly also turning it into an asset-packer tool; where it
is useful to make graphics/sounds/etc in one set of formats and then
process and convert them into another set of files, usually inside of
some sort of VFS image or similar).
Design is much more simplistic than something like SVG and I am
currently assuming its use for mostly hand-edited files. Unlike SVG,
it also assumes drawing to a pixel grid rather than some more
abstract coordinate space (so, its abstract model is more like "MS
Paint" or similar); also SVG would suck as a human-edited format.
Granted, one could argue maybe it could make scope that asset-
processing is its own tool, then one converts it to a format that the
compiler accepts (WAD2 or WAD4 in this case) prior to compiling the
main binary (and/or not use resource data).
Still, IMO, an internal WAD image is still better than the horrid/
unusable mess that Windows had used (where anymore most people don't
bother with the resource section much more than storing a program
icon or similar...).
But, realistically, one does still want to limit how much data they
stick into the EXE.
...
tend to use libraries already written by other people. I assume people
a lot brighter than myself have come up with them.
I usually wrote my own code for most things.
Not dealt much with FLIC.
In the past, whenever doing animated stuff, had usually used the AVI
file format. A lot of time, the codecs were custom.
Both AVI (and BMP) can be used to hold a wide range of image data,
partly as a merit of using FOURCCs.
Over the course of the past 15 years, have fiddled a lot here.
A few of the longer-lived ones:
BTIC1C (~ 2010):
Was a modified version of RPZA with Deflate compression glued on.
BTIC1H:
Made use of multiple block formats,
used STF+AdRice for entropy coding, and Paeth for color endpoints.
Block formats, IIRC:
4x4x2, 4x2x2, 2x4x2, 2x2x2, 4x4x1, 2x2x1, flat
4x4x2: 32-bits for pixel selectors
2x2x2: 8 bits for pixel selectors
BTIC4B:
Similar to BTIC1H, but a lot more complicated.
Switched to 8x8 blocks, so had a whole lot of block formats.
Shorter-Lived:
BTIC2C: Similar design to MPEG;
IIRC, used Huffman, but updated the Huffman tables for each I-Frame.
This sort of thing being N/A with STF+AdRice,
which starts from a clean slate every time.
1C: Was used for animated textures in my first 3D engine.
1H and 4B could be used for video, but were also used in my second 3D
engine for sprites and textures (inside of a BMP packaging).
My 3rd 3D engine is mostly using a mix of:
DDS (mostly DXt1)
BMP (mostly 16 color and 256 color).
Though, in modern times, things like 16-color graphics are overlooked,
in some cases they are still usable or useful (or at least sufficient).
Typically, I had settled on a variant of the CGA/EGA color palette:
0: 000000 (Black)
1: 0000AA (Blue)
2: 00AA00 (Green)
3: 00AAAA (Cyan)
4: AA0000 (Red)
5: AA00AA (Magenta)
6: AA5500 (Brown)
7: AAAAAA (LightGray)
8: 555555 (DarkGray)
9: 5555FF (LightBlue)
A: 55FF55 (LightGreen)
B: 55FFFF (LightCyan)
C: FF5555 (LightRed)
D: FF55FF (Violet)
E: FFFF55 (Yellow)
F: FFFFFF (White)
I am not sure why they changed it for the default 16-color assignments
in VGA (eg, in the Windows 256-color system palette). Like, IMO, 00/AA
and 55/FF works better for typical 16-color use-cases than 00/80 and 00/FF.
Sorta depends on use-case: Sometimes something works well as 16 colors, other times it would fall on its face.
Most other designs sucked so bad they didn't get very far.
Where, I had ended up categorizing designs:
BTIC1x: Designs mostly following an RPZA like path.
1C: RPZA + Deflate
Mostly built on 4x4x2 blocks (32 bits).
1D, 1E: Byte-Encoding + Deflate
Both sucked, quickly dropped.
Both were like RPZA both with 48-bit 4:2:0 blocks.
Neither great compression nor particularly fast.
Deflate carries a high computational overhead.
1F, 1G: No entropy coding (back to being like RPZA)
Major innovations: Variable-size pixel blocks.
1H: STF+AdRice
Mostly final state of 1x line.
BTIC2x: Designs mostly influenced by JPEG and MPEG.
Difficult to make particularly fast.
1A/1B: Modified MJPEG IIRC.
Technically, also based on my BTJPEG format (*1).
2C: IIRC, MPEG-like, Huffman-coded.
Well influenced by both MPEG and the Xiph Theora codec.
2D: Like 2C, but STF+AdRice
2E: Like 2C, but byte stream based
Was trying, mostly in vain, to make it faster.
My attempts at this style of codecs were mostly, too slow.
2F: Goes back to a more JPEG like core in some ways.
Entropy and VLN scheme borrows more from Deflate.
Though, uses a shorter limit on max symbol length (13 bit).
13 bit simplifies things and makes decoding faster vs 15 bit.
Abandons DCT and YCbCr in favor of Block-Haar and RCT.
Later, UPIC did similar, just with STF+AdRice versus Huffman.
BTIC3x:
Attempts to hybridize 1x and 2x
Nothing implemented, all designs too complicated to bother with.
BTIC4x:
4A: RPZA-like but with 8x8 blocks and multiple block sizes.
4B: Like 4A but reusing the encoding scheme from 1H.
BTIC5x:
5A: Resembled a CRAM/QOI hybrid, but with 8-bit indexed colors.
No entropy coding.
5B: Like 5A, but used differential RGB555 (still QOI like).
Major innovation was to use a 6-bit 64-entry pattern table.
Optionally, can use per-frame RP2 or TKuLZ compression.
Used if doing so results in a significant savings.
*1: BTJPEG was an attempt at making a more advanced image format based
on tweaking the existing T.81 JPEG format in a way that sorta worked in existing decoders. The more widespread use (and "not totally dead"
feature) being to allow for an embedded alpha channel as essentially
another monochrome JPEG inside the APP11 marker.
I had tried a bunch of other ideas, but it turned into a mess of experimental tweaks, and most of it died off. The surviving variant is basically just T.81+JFIF with an optional alpha channel (ignored by a non-aware JPEG decoder).
Some other (mostly dead) tweaks were things like:
Allowing multi-layered images (more like Paint.NET's PDN or GIMP's XCF, mostly by nesting the images like a Matryoshka doll), where the top-
level image would contain a view of all the layers rendered together; Allowing lossless images (similar to PNG) by using SERMS-RDCT and RCT
(where SERMS-RDCT was a trick to make the DCT/IDCT transform exactly reversible, at the cost of speed).
In the early 2010s, I was pretty bad about massively over-engineering everything.
Later on, some ideas were reused in 2F and UPIC.
Though, 2F and UPIC were much less over-engineered.
Did specify possible use as video codecs, but thus far both were used
only as still image formats.
The major goal for UPIC was mostly be to address the core use-cases but
also for the decoder to be small and relatively cheap. Still sorta JPEG competitive despite being primarily cost-optimized to try to make it
more viable for use in programs running on the BJX2 core (where JPEG decoding is slow and expensive).
As for Static Huffman vs STF+AdRice:
Huffman:
+ Slightly faster for larger payloads
+ Optimal for a static distribution
- Higher memory cost for decoding (storing decoder tables)
- High initial setup cost (setting up decoder tables)
- Higher constant overhead (storing symbol lengths)
- Need to provision for storing Huffman tables
STF+AdRice:
+ Very cheap initial setup (minimal context)
+ No need to transmit tables
+ Better compression for small data
+ Significantly faster than Adaptive Huffman
+ Significantly faster than Range Coding
- Slower for large data and worse compression vs Huffman.
Where, STF+AdRice is mostly:
Have a table of symbols;
Whenever a symbol is encoded, swap it forwards;
Next time, it may potentially be encoded with a smaller index.
Encode indices into table using Adaptive Rice Codes.
Or, basically, using a lookup table to allow AdRice to pretend to be Huffman. Also reasonably fast and simple.
Block-Haar vs DCT:
+ Block-Haar is faster and easily reversible (lossless);
+ Mostly a drop-in replacement for DCT/IDCT in the design.
+ Also faster than WHT (Walsh-Hadamard Transform)
RCT vs YCbCr:
RCT is both slightly faster, and also reversible;
Had experimented with YCoCg, but saw no real advantage over RCT.
The existence of BTIC5x was mostly because:
BTIC1H and BTIC4B were too computationally demanding to do 320x200 16Hz
on a 50MHz BJX2 core;
MS-CRAM was fast to decode, but needed too much bitrate (SDcard couldn't keep the decoder fed with any semblance of image quality).
So, 5A and 5B were aimed at trying to give tolerable Q/bpp at more CRAM- like decoding speeds.
Also, while reasonably effective (and fast desktop by PC standards), one other drawback of the 4B design (and to a lesser degree 1H) was the
design being overly complicated (and thus the code is large and bulky).
Part of this was due to having too many block formats.
If my UPIC format were put into my older naming scheme, would likely be called 2G. Design is kinda similar to 2F, but replaces Huffman with STF+AdRice.
As for RP2 and TKuLZ:
RP2 is a byte-oriented LZ77 variant, like LZ4,
but on-average compresses slightly better than LZ4.
TKuLZ: Is sorta like a simplified/tuned Deflate variant.
Uses a shorter max symbol length,
borrows some design elements from LZ4.
Can note, some past experiments with LZ decompression (at Desktop PC speeds), with entropy scheme, and len/dist limits:
LZMA : ~ 35 MB/sec (Range Coding, 273/ 4GB)
Zstd : ~ 60 MB/sec (tANS, 16MB/ 128MB)
Deflate: ~ 175 MB/sec (Huffman, 258/ 32767)
TKuLZ : ~ 300 MB/sec (Huffman, 65535/262143)
RP2 : ~ 1100 MB/sec (Raw Bytes, 512/131071)
LZ4 : ~ 1300 MB/sec (Raw Bytes, 16383/ 65535)
While Zstd is claimed to be fast, my testing tended to show it closer to LZMA speeds than to Deflate, but it does give compression closer to
LZMA. The tANS strategy seems to under-perform claims IME (and is
notably slower than static Huffman). Also it is the most complicated
design among these.
A lot of my older stuff used Deflate, but often Deflate wasn't fast
enough, so has mostly gotten displaced by RP2 in my uses.
TKuLZ is an intermediate, generally faster than Deflate, had an option
to get some speed (at the expense of compression) by using fixed length symbols in some cases. This can push it to around 500 MB/sec (at the
expense of compression), hard to get much faster (or anywhere near RP2
or LZ4).
Whether RP2 or LZ4 is faster seems to depend on target:
BJX2 Core, RasPi, and Piledriver: RP2 is faster.
Mostly things with in-order cores.
And Piledriver, which behaved almost more like an in-order machine.
Zen+, Core 2, and Core i7: LZ4 is faster.
LZ4 needs typically multiple chained memory accesses for each LZ run, whereas for RP2, match length/distance and raw count are typically all available via a single memory load (then maybe a few bit-tests and conditional branches).
...
A while ago I wrote a set of graphics routines in assembler that were
quite fast. One format I have delt with is the .flic file format used
to render animated graphics. I wanted to write my own CIV style game.
It took a little bit of research and some reverse engineering.
Apparently, the authors used a modified version of the format making
it difficult to use the CIV graphics in my own game. I never could get
it to render as fast as the game’s engine. I wrote the code for my
game in C or C++, the original’s game engine code was likely in a
different language.
This sort of thing is almost inevitable with this stuff.
Usually I just ended up using C for nearly everything.
*****
Been working on vectors for the ISA. I split the vector length
register into eight sections to define up to eight different vector
lengths. The first five are defined for integer, float, fixed,
character, and address data types. I figure one may want to use
vectors of different lengths at the same time, for instance to address
data using byte offsets, while the data itself might be a float. The
vector load / store instructions accept a data type to load / store
and always use the address type for address calculations.
There is also a vector lane size register split up the same way. I had
thought of giving each vector register its own format for length and
lane size. But thought that is a bit much, with limited use cases.
I think I can get away with only two load and two store instructions.
One to do a strided load and a second to do an vector indexed load
(gather/scatter). The addressing mode in use is d[Rbase+Rindex*Scale].
Where Rindex is used as the stride when scalar or as a supplier of the
lane offset when Rindex is a vector.
Writing the RTL code to support the vector memory ops has been
challenging. Using a simple approach ATM. The instruction needs to be
re-issued for each vector lane accessed. Unaligned vector loads and
stores are also allowed, adding some complexity when the operation
crosses a cache-line boundary.
I have the max vector length and max vector size constants returned by
the GETINFO instruction which returns CPU specific information.
I don't get it...
Usually makes sense to treat vectors as opaque blobs of bits that are
then interpreted as one of the available formats for a specific operation.
In my case, I have a SIMD setup:
2 or 4 elements in a GPR or GPR pair;
Most other operations are just the normal GPR operations.
...
On 2025-11-22 5:54 a.m., BGB wrote:
On 11/21/2025 9:09 PM, Robert Finch wrote:Many vector machines (RISCV-V) have a way of specifying the vector
On 2025-11-21 2:36 p.m., BGB wrote:
On 11/21/2025 7:31 AM, Michael S wrote:My forays into the world of graphics formats are pretty limited. I
On Thu, 13 Nov 2025 19:04:18 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Michael S <already5chosen@yahoo.com> posted:
Not really.
That is, conversions are not blazingly fast, but still much better >>>>>>> than any attempt to divide in any form of decimal. And helps to
preserve your sanity.
Are you trying to pull our proverbial leg here ?!?
After reading paragraph 5.2 of IEEE-754-2008 Standard I am less
sure in
correctness of my above statement.
For the case of exact division, preservation of mental sanity during >>>>> fulfillment of requirements of this paragraph is far from simple,
regardless of numeric base used in the process.
One effectively needs to do a special extra-wide divide rather than
just a normal integer divide, etc.
But, yeah, fastest I had gotten in my experiments was radix-10e9
long- division, but still not the fastest option.
So, rough ranking, fast to slow:
Radix-10e9 Long Divide (fastest)
Newton-Raphson
Radix-10 Long Divide
Integer Shift-Subtract with converters (slowest).
Fastest converter strategy ATM:
Radix-10e9 double-dabble (Int->Dec).
MUL-by-10e9 and ADD (Dec->Int)
Fastest strategy: Unrolled Shifts and ADDs (*1).
*1: While it is possible to perform a 128-bit multiply decomposing
into multiplying 32-bit parts and adding them together; it was
working out slightly faster in this case to do a fixed multiply by
decomposing it into a series of explicit shifts and ADDs.
Though, in this case, it is faster (and less ugly) to decompose this
into a pattern of iteratively multiplying by smaller amounts. I had
ended up using 4x multiply by 100 followed by multiply by 10, as
while not the fastest strategy, needs less code than 2x multiply by
10000 + multiply by 10. Most other patterns would need more shifts
and adds.
In theory, x86-64 could do it better with multiply ops, but getting
something optimal out of the C compilers is a bigger issue here it
seems.
Unexplored options:
Radix 10e2 (byte)
Radix 10e3 (word)
Radix 10e4 (word)
Radix 10e3 could have the closest to direct mapping to DPD.
Looking at the decNumber code, it appears also to be Radix-10e9 based. >>>> They also do significant (ab)use of the C preprocessor.
Apparently, "Why use functions when you can use macros?"...
For the Radix-10e9 long-divide, part of the magic was in the
function to scale a value by a radix value and subtract it from
another array.
Ended up trying a few options, fastest was to temporarily turn the
operation into non-normalized 64-bit pieces and then normalize the
result (borrow propagation, etc) as an output step.
Initial attempt kept it normalized within the operation, which was
slower.
It was seemingly compiler-dependent whether it was faster to do a
combined operation, or separate scale and subtract, but the margins
were small. On MSVC the combined operation was slightly faster than
the separate operations.
...
Otherwise, after this, just went and fiddled with BGBCC some more,
adding more options for its resource converter.
Had before (for image formats):
In: TGA, BMP (various), PNG, QOI, UPIC
Out: BMP (various), QOI, UPIC
Added (now):
In: PPM, JPG, DDS
Out: PNG, JPG, DDS (DXT1 and DXT5)
Considered (not added yet):
PCX
Evaluated PCX, possible but not a clear win.
Fiddled with making the PNG encoder less slow, mostly this was
tweaking some parameters for the LZ searches. Initial settings were
using deeper searches over initially smaller sliding windows (at
lower compression levels); better in this case to do a shallower
search over a max-sized sliding window.
ATM, speed of PNG is now on-par with the JPG encoder (still one of
the slower options).
For simple use-cases, PNG still loses (in terms of both speed and
compression) to 16-color BMP + LZ compression (LZ4 or RP2).
Theoretically, indexed-color PNG exists, but is less widely supported. >>>>
It is less space-efficient to represent 16-colors as Deflate-
compressed color differences than it is to just represent the 4-bit
RGBI values directly.
However, can note that the RLE compression scheme (used by PCX) is
clearly inferior to that of any sort of LZ compression.
Comparably, PNG is also a more expensive format to decode as well
(even vs JPEG).
UPIC can partly address the use-cases of both PNG and JPEG while
being cheaper to decode than either, but more niche as pretty much
nothing supports it. Some of its design and properties being mostly
JPEG-like.
QOI is interesting, but suffers some similar limitations to PCX (its
design is mostly about more compactly encoding color-differences in
true-color images and otherwise only offers RLE compression).
QOI is not particularly effective against images with little variety
in color variation but lots of repeating patterns (I have a modified
QOI that does a little better here, still not particularly effective
with 16-color graphics though).
Otherwise, also added up adding a small text format for image
drawing commands.
As a simplistic line oriented format containing various commands to
perform drawing operations or composite images.
creating a "canvas"
setting the working color
drawing lines
bucket fill
drawing text strings
overlaying other images
...
This is maybe (debatable) outside the scope of a C compiler, but
could have use-cases for preparing resource data (nevermind if scope
creep is partly also turning it into an asset-packer tool; where it
is useful to make graphics/sounds/etc in one set of formats and then
process and convert them into another set of files, usually inside
of some sort of VFS image or similar).
Design is much more simplistic than something like SVG and I am
currently assuming its use for mostly hand-edited files. Unlike SVG,
it also assumes drawing to a pixel grid rather than some more
abstract coordinate space (so, its abstract model is more like "MS
Paint" or similar); also SVG would suck as a human-edited format.
Granted, one could argue maybe it could make scope that asset-
processing is its own tool, then one converts it to a format that
the compiler accepts (WAD2 or WAD4 in this case) prior to compiling
the main binary (and/or not use resource data).
Still, IMO, an internal WAD image is still better than the horrid/
unusable mess that Windows had used (where anymore most people don't
bother with the resource section much more than storing a program
icon or similar...).
But, realistically, one does still want to limit how much data they
stick into the EXE.
...
tend to use libraries already written by other people. I assume
people a lot brighter than myself have come up with them.
I usually wrote my own code for most things.
Not dealt much with FLIC.
In the past, whenever doing animated stuff, had usually used the AVI
file format. A lot of time, the codecs were custom.
Both AVI (and BMP) can be used to hold a wide range of image data,
partly as a merit of using FOURCCs.
Over the course of the past 15 years, have fiddled a lot here.
A few of the longer-lived ones:
BTIC1C (~ 2010):
Was a modified version of RPZA with Deflate compression glued on. >> BTIC1H:
Made use of multiple block formats,
used STF+AdRice for entropy coding, and Paeth for color endpoints.
Block formats, IIRC:
4x4x2, 4x2x2, 2x4x2, 2x2x2, 4x4x1, 2x2x1, flat
4x4x2: 32-bits for pixel selectors
2x2x2: 8 bits for pixel selectors
BTIC4B:
Similar to BTIC1H, but a lot more complicated.
Switched to 8x8 blocks, so had a whole lot of block formats.
Shorter-Lived:
BTIC2C: Similar design to MPEG;
IIRC, used Huffman, but updated the Huffman tables for each I-Frame.
This sort of thing being N/A with STF+AdRice,
which starts from a clean slate every time.
1C: Was used for animated textures in my first 3D engine.
1H and 4B could be used for video, but were also used in my second 3D
engine for sprites and textures (inside of a BMP packaging).
My 3rd 3D engine is mostly using a mix of:
DDS (mostly DXt1)
BMP (mostly 16 color and 256 color).
Though, in modern times, things like 16-color graphics are overlooked,
in some cases they are still usable or useful (or at least sufficient).
Typically, I had settled on a variant of the CGA/EGA color palette:
0: 000000 (Black)
1: 0000AA (Blue)
2: 00AA00 (Green)
3: 00AAAA (Cyan)
4: AA0000 (Red)
5: AA00AA (Magenta)
6: AA5500 (Brown)
7: AAAAAA (LightGray)
8: 555555 (DarkGray)
9: 5555FF (LightBlue)
A: 55FF55 (LightGreen)
B: 55FFFF (LightCyan)
C: FF5555 (LightRed)
D: FF55FF (Violet)
E: FFFF55 (Yellow)
F: FFFFFF (White)
I am not sure why they changed it for the default 16-color assignments
in VGA (eg, in the Windows 256-color system palette). Like, IMO, 00/AA
and 55/FF works better for typical 16-color use-cases than 00/80 and
00/FF.
Sorta depends on use-case: Sometimes something works well as 16
colors, other times it would fall on its face.
Most other designs sucked so bad they didn't get very far.
Where, I had ended up categorizing designs:
BTIC1x: Designs mostly following an RPZA like path.
1C: RPZA + Deflate
Mostly built on 4x4x2 blocks (32 bits).
1D, 1E: Byte-Encoding + Deflate
Both sucked, quickly dropped.
Both were like RPZA both with 48-bit 4:2:0 blocks.
Neither great compression nor particularly fast.
Deflate carries a high computational overhead.
1F, 1G: No entropy coding (back to being like RPZA)
Major innovations: Variable-size pixel blocks.
1H: STF+AdRice
Mostly final state of 1x line.
BTIC2x: Designs mostly influenced by JPEG and MPEG.
Difficult to make particularly fast.
1A/1B: Modified MJPEG IIRC.
Technically, also based on my BTJPEG format (*1).
2C: IIRC, MPEG-like, Huffman-coded.
Well influenced by both MPEG and the Xiph Theora codec.
2D: Like 2C, but STF+AdRice
2E: Like 2C, but byte stream based
Was trying, mostly in vain, to make it faster.
My attempts at this style of codecs were mostly, too slow.
2F: Goes back to a more JPEG like core in some ways.
Entropy and VLN scheme borrows more from Deflate.
Though, uses a shorter limit on max symbol length (13 bit). >> 13 bit simplifies things and makes decoding faster vs 15 bit.
Abandons DCT and YCbCr in favor of Block-Haar and RCT.
Later, UPIC did similar, just with STF+AdRice versus Huffman.
BTIC3x:
Attempts to hybridize 1x and 2x
Nothing implemented, all designs too complicated to bother with.
BTIC4x:
4A: RPZA-like but with 8x8 blocks and multiple block sizes.
4B: Like 4A but reusing the encoding scheme from 1H.
BTIC5x:
5A: Resembled a CRAM/QOI hybrid, but with 8-bit indexed colors.
No entropy coding.
5B: Like 5A, but used differential RGB555 (still QOI like).
Major innovation was to use a 6-bit 64-entry pattern table.
Optionally, can use per-frame RP2 or TKuLZ compression.
Used if doing so results in a significant savings.
*1: BTJPEG was an attempt at making a more advanced image format based
on tweaking the existing T.81 JPEG format in a way that sorta worked
in existing decoders. The more widespread use (and "not totally dead"
feature) being to allow for an embedded alpha channel as essentially
another monochrome JPEG inside the APP11 marker.
I had tried a bunch of other ideas, but it turned into a mess of
experimental tweaks, and most of it died off. The surviving variant is
basically just T.81+JFIF with an optional alpha channel (ignored by a
non-aware JPEG decoder).
Some other (mostly dead) tweaks were things like:
Allowing multi-layered images (more like Paint.NET's PDN or GIMP's
XCF, mostly by nesting the images like a Matryoshka doll), where the
top- level image would contain a view of all the layers rendered
together;
Allowing lossless images (similar to PNG) by using SERMS-RDCT and RCT
(where SERMS-RDCT was a trick to make the DCT/IDCT transform exactly
reversible, at the cost of speed).
In the early 2010s, I was pretty bad about massively over-engineering
everything.
Later on, some ideas were reused in 2F and UPIC.
Though, 2F and UPIC were much less over-engineered.
Did specify possible use as video codecs, but thus far both were used
only as still image formats.
The major goal for UPIC was mostly be to address the core use-cases
but also for the decoder to be small and relatively cheap. Still sorta
JPEG competitive despite being primarily cost-optimized to try to make
it more viable for use in programs running on the BJX2 core (where
JPEG decoding is slow and expensive).
As for Static Huffman vs STF+AdRice:
Huffman:
+ Slightly faster for larger payloads
+ Optimal for a static distribution
- Higher memory cost for decoding (storing decoder tables)
- High initial setup cost (setting up decoder tables)
- Higher constant overhead (storing symbol lengths)
- Need to provision for storing Huffman tables
STF+AdRice:
+ Very cheap initial setup (minimal context)
+ No need to transmit tables
+ Better compression for small data
+ Significantly faster than Adaptive Huffman
+ Significantly faster than Range Coding
- Slower for large data and worse compression vs Huffman.
Where, STF+AdRice is mostly:
Have a table of symbols;
Whenever a symbol is encoded, swap it forwards;
Next time, it may potentially be encoded with a smaller index.
Encode indices into table using Adaptive Rice Codes.
Or, basically, using a lookup table to allow AdRice to pretend to be
Huffman. Also reasonably fast and simple.
Block-Haar vs DCT:
+ Block-Haar is faster and easily reversible (lossless);
+ Mostly a drop-in replacement for DCT/IDCT in the design.
+ Also faster than WHT (Walsh-Hadamard Transform)
RCT vs YCbCr:
RCT is both slightly faster, and also reversible;
Had experimented with YCoCg, but saw no real advantage over RCT.
The existence of BTIC5x was mostly because:
BTIC1H and BTIC4B were too computationally demanding to do 320x200
16Hz on a 50MHz BJX2 core;
MS-CRAM was fast to decode, but needed too much bitrate (SDcard
couldn't keep the decoder fed with any semblance of image quality).
So, 5A and 5B were aimed at trying to give tolerable Q/bpp at more
CRAM- like decoding speeds.
Also, while reasonably effective (and fast desktop by PC standards),
one other drawback of the 4B design (and to a lesser degree 1H) was
the design being overly complicated (and thus the code is large and
bulky).
Part of this was due to having too many block formats.
If my UPIC format were put into my older naming scheme, would likely
be called 2G. Design is kinda similar to 2F, but replaces Huffman with
STF+AdRice.
As for RP2 and TKuLZ:
RP2 is a byte-oriented LZ77 variant, like LZ4,
but on-average compresses slightly better than LZ4.
TKuLZ: Is sorta like a simplified/tuned Deflate variant.
Uses a shorter max symbol length,
borrows some design elements from LZ4.
Can note, some past experiments with LZ decompression (at Desktop PC
speeds), with entropy scheme, and len/dist limits:
LZMA : ~ 35 MB/sec (Range Coding, 273/ 4GB)
Zstd : ~ 60 MB/sec (tANS, 16MB/ 128MB)
Deflate: ~ 175 MB/sec (Huffman, 258/ 32767)
TKuLZ : ~ 300 MB/sec (Huffman, 65535/262143)
RP2 : ~ 1100 MB/sec (Raw Bytes, 512/131071)
LZ4 : ~ 1300 MB/sec (Raw Bytes, 16383/ 65535)
While Zstd is claimed to be fast, my testing tended to show it closer
to LZMA speeds than to Deflate, but it does give compression closer to
LZMA. The tANS strategy seems to under-perform claims IME (and is
notably slower than static Huffman). Also it is the most complicated
design among these.
A lot of my older stuff used Deflate, but often Deflate wasn't fast
enough, so has mostly gotten displaced by RP2 in my uses.
TKuLZ is an intermediate, generally faster than Deflate, had an option
to get some speed (at the expense of compression) by using fixed
length symbols in some cases. This can push it to around 500 MB/sec
(at the expense of compression), hard to get much faster (or anywhere
near RP2 or LZ4).
Whether RP2 or LZ4 is faster seems to depend on target:
BJX2 Core, RasPi, and Piledriver: RP2 is faster.
Mostly things with in-order cores.
And Piledriver, which behaved almost more like an in-order machine. >> Zen+, Core 2, and Core i7: LZ4 is faster.
LZ4 needs typically multiple chained memory accesses for each LZ run,
whereas for RP2, match length/distance and raw count are typically all
available via a single memory load (then maybe a few bit-tests and
conditional branches).
...
A while ago I wrote a set of graphics routines in assembler that were
quite fast. One format I have delt with is the .flic file format used
to render animated graphics. I wanted to write my own CIV style game.
It took a little bit of research and some reverse engineering.
Apparently, the authors used a modified version of the format making
it difficult to use the CIV graphics in my own game. I never could
get it to render as fast as the game’s engine. I wrote the code for
my game in C or C++, the original’s game engine code was likely in a
different language.
This sort of thing is almost inevitable with this stuff.
Usually I just ended up using C for nearly everything.
*****
Been working on vectors for the ISA. I split the vector length
register into eight sections to define up to eight different vector
lengths. The first five are defined for integer, float, fixed,
character, and address data types. I figure one may want to use
vectors of different lengths at the same time, for instance to
address data using byte offsets, while the data itself might be a
float. The vector load / store instructions accept a data type to
load / store and always use the address type for address calculations.
There is also a vector lane size register split up the same way. I
had thought of giving each vector register its own format for length
and lane size. But thought that is a bit much, with limited use cases.
I think I can get away with only two load and two store instructions.
One to do a strided load and a second to do an vector indexed load
(gather/scatter). The addressing mode in use is
d[Rbase+Rindex*Scale]. Where Rindex is used as the stride when scalar
or as a supplier of the lane offset when Rindex is a vector.
Writing the RTL code to support the vector memory ops has been
challenging. Using a simple approach ATM. The instruction needs to be
re-issued for each vector lane accessed. Unaligned vector loads and
stores are also allowed, adding some complexity when the operation
crosses a cache-line boundary.
I have the max vector length and max vector size constants returned
by the GETINFO instruction which returns CPU specific information.
I don't get it...
Usually makes sense to treat vectors as opaque blobs of bits that are
then interpreted as one of the available formats for a specific
operation.
In my case, I have a SIMD setup:
2 or 4 elements in a GPR or GPR pair;
Most other operations are just the normal GPR operations.
...
length and element size, but it tends to be a global setting which may
be overridden in some cases by specifying in the instruction. For Qupls
it also allows setting based on the data type which is a bit of a
misnomer, it would be better named data format. It is just three bits in
the instruction that select one of the fields in the VLEN, VELSZ
registers. The instruction itself specifies the data type for the
operation on an opaque bag of bits. It is possible to encode selecting
the integer size fields, then performing a float operation on the data.
The size agnostic instructions use the micro-op translator to convert
the instructions into size specific versions. The translator calculates
the number of architectural registers required then puts the appropriate number of instructions (up to eight) in the micro-op queue.
Therefore, there are lots of vector instructions in the ISA. SIMD type instructions where the size of a vector is assumed to be one register,
and the element size is specified by the instruction. So, separate instructions for 1,2,4 or 8 elements. (For example 50 instructions *
four different sizes = 200 instructions). Then also size agnostic instructions where the size/format comes indirectly from the VLEN
(vector length) and VELSZ (vector lane size) registers.
The size agnostic instructions allow writing a generic vector routine without needing to code the size of the operation. This avoids having a switch statement with a whole bunch of cases for different vector
lengths. It also avoids having thousands of vector instructions. (50 instructions * 5 different lanes sizes * 64 different lengths).
The vectors are opaque blobs of bytes in my case. Size specs are in
terms of bytes. The vectors are not a fixed length. They may (currently)
use from 0 to 8 GPR registers. Hence the need to specify the length in
use. While the length could be specified as part of the format for the instruction, that would require a wide instruction.
*****
.flic file format is supposed to be fast enough to allow use “on the fly”. But I just decompress all the frames into a matrix of bitmaps at game startup, then select the appropriate one based on direction and
timing. With dozens of different sprites and hundreds of frames, I think
it takes about 3GB of memory just for the sprite data. I had trouble
running this on my machine a few years ago, but maybe with newer
technology it could work.
Experimented some with LZ4 and Huffman encoding. Huffman used for ECC
logic.
On 2025-11-11 2:30 p.m., MitchAlsup wrote:
Robert Finch <robfi680@gmail.com> posted:
Typical process for NaN boxing is to set the high order bits of the
value which causes the value to appear to be a NaN at higher precision.
Any FP value representable in lower precision can be exactly represented
in higher precision.
I have been thinking about using some of the high order bits of the NaN
(eg bits 32 to 51) to indicate the precision of the boxed value.
When My 66000 generates a NaN it inserts the cause in the 3 HoBs and inserts IP in the LoBs. Nothing prevents you from overwriting the NaN,
but I thought it was best to point at the causing-instruction and an encoded "why" the nan was generated. The cause is a 3-bit index to the
7 defined IEEE exceptions.
My float package puts the cause in the 3 LoBs. The cause is always in
the low order bits of the register then, even when the precision is different. But the address is not tracked. The package does not have
access to the address. Seems like NaN trace hardware might be useful.
There are rules when more than 1 NaN are an operand to an instruction designed to leave the more important NaN as the result. {Where more important is generally the first to be generated.}
Hopefully the package follows the rules correctly. NaN operation is one thing not tested yet.
This
would allow detection of the use of a lower precision value in
arithmetic. Suppose a convert from single to double precision is being
done, but the value to be converted is only half precision. If it were
indicated by the NaN software might be able to fix the result.
I think it is better to fix the SW that thinks a (half) is a (float).
It would be better, but some software is so complex it may be unknown
the values coming in. The SW does not really need to croak if its a
lower precision value as they are always represent-able in a higher precision.>>
I also
preserve the sign bit of the number in the NaN box.
Robert Finch <robfi680@gmail.com> posted:
On 2025-11-11 2:30 p.m., MitchAlsup wrote:
My float package puts the cause in the 3 LoBs. The cause is always in
Robert Finch <robfi680@gmail.com> posted:
Typical process for NaN boxing is to set the high order bits of theAny FP value representable in lower precision can be exactly represented >>> in higher precision.
value which causes the value to appear to be a NaN at higher precision. >>>
I have been thinking about using some of the high order bits of the NaN >>>> (eg bits 32 to 51) to indicate the precision of the boxed value.
When My 66000 generates a NaN it inserts the cause in the 3 HoBs and
inserts IP in the LoBs. Nothing prevents you from overwriting the NaN,
but I thought it was best to point at the causing-instruction and an
encoded "why" the nan was generated. The cause is a 3-bit index to the
7 defined IEEE exceptions.
the low order bits of the register then, even when the precision is
different. But the address is not tracked. The package does not have
access to the address. Seems like NaN trace hardware might be useful.
Suggest you read:: https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf
For conversation about LoBs versus HoBs.
There are rules when more than 1 NaN are an operand to an instructionHopefully the package follows the rules correctly. NaN operation is one
designed to leave the more important NaN as the result. {Where more
important is generally the first to be generated.}
thing not tested yet.
It would be better, but some software is so complex it may be unknownThis >>>> would allow detection of the use of a lower precision value in
arithmetic. Suppose a convert from single to double precision is being >>>> done, but the value to be converted is only half precision. If it were >>>> indicated by the NaN software might be able to fix the result.
I think it is better to fix the SW that thinks a (half) is a (float).
the values coming in. The SW does not really need to croak if its a
lower precision value as they are always represent-able in a higher
precision.>>
I also
preserve the sign bit of the number in the NaN box.
On 2025-11-22 10:20 p.m., MitchAlsup wrote:
Robert Finch <robfi680@gmail.com> posted:
On 2025-11-11 2:30 p.m., MitchAlsup wrote:
My float package puts the cause in the 3 LoBs. The cause is always in
Robert Finch <robfi680@gmail.com> posted:
Typical process for NaN boxing is to set the high order bits of the
value which causes the value to appear to be a NaN at higher
precision.
Any FP value representable in lower precision can be exactly
represented
in higher precision.
I have been thinking about using some of the high order bits of the >>>>> NaN
(eg bits 32 to 51) to indicate the precision of the boxed value.
When My 66000 generates a NaN it inserts the cause in the 3 HoBs and
inserts IP in the LoBs. Nothing prevents you from overwriting the NaN, >>>> but I thought it was best to point at the causing-instruction and an
encoded "why" the nan was generated. The cause is a 3-bit index to the >>>> 7 defined IEEE exceptions.
the low order bits of the register then, even when the precision is
different. But the address is not tracked. The package does not have
access to the address. Seems like NaN trace hardware might be useful.
Suggest you read::
https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/
nan-propagation.pdf
For conversation about LoBs versus HoBs.
Okay, it sounds like there are good reasons to use the HoBs. But I think
it is only when converting precisions that it makes a difference. I have
the float package moving the LoBs of a larger precision to the LoBs of
the lower precision if a NaN (or infinity) is present. I do not think
this consumes any more logic. It looks like just wires. It looks to be a three bit mux on the low order bits going the other way.
I suppose I could code the package to accept NaN values either way.
The following NaN values are in use.
`define QSUBINFD 63'h7FF0000000000001 // - infinity - infinity
`define QINFDIVD 63'h7FF0000000000002 // - infinity / infinity `define QZEROZEROD 63'h7FF0000000000003 // - zero / zero
`define QINFZEROD 63'h7FF0000000000004 // - infinity X zero `define QSQRTINFD 63'h7FF0000000000005 // - square root of infinity
`define QSQRTNEGD 63'h7FF0000000000006 // - square root of negaitve number
There are rules when more than 1 NaN are an operand to an instructionHopefully the package follows the rules correctly. NaN operation is one
designed to leave the more important NaN as the result. {Where more
important is generally the first to be generated.}
thing not tested yet.
It would be better, but some software is so complex it may be unknown
This
would allow detection of the use of a lower precision value in
arithmetic. Suppose a convert from single to double precision is being >>>>> done, but the value to be converted is only half precision. If it were >>>>> indicated by the NaN software might be able to fix the result.
I think it is better to fix the SW that thinks a (half) is a (float).
the values coming in. The SW does not really need to croak if its a
lower precision value as they are always represent-able in a higher
precision.>>
I also
preserve the sign bit of the number in the NaN box.
On 2025-11-22 11:16 p.m., Robert Finch wrote:
On 2025-11-22 10:20 p.m., MitchAlsup wrote:When converting a NaN from higher to lower precision, the float package preserves both the low order four bits and as many high order bits of
Robert Finch <robfi680@gmail.com> posted:
On 2025-11-11 2:30 p.m., MitchAlsup wrote:
My float package puts the cause in the 3 LoBs. The cause is always in
Robert Finch <robfi680@gmail.com> posted:
Typical process for NaN boxing is to set the high order bits of the >>>>>> value which causes the value to appear to be a NaN at higher
precision.
Any FP value representable in lower precision can be exactly
represented
in higher precision.
I have been thinking about using some of the high order bits of
the NaN
(eg bits 32 to 51) to indicate the precision of the boxed value.
When My 66000 generates a NaN it inserts the cause in the 3 HoBs and >>>>> inserts IP in the LoBs. Nothing prevents you from overwriting the NaN, >>>>> but I thought it was best to point at the causing-instruction and an >>>>> encoded "why" the nan was generated. The cause is a 3-bit index to the >>>>> 7 defined IEEE exceptions.
the low order bits of the register then, even when the precision is
different. But the address is not tracked. The package does not have
access to the address. Seems like NaN trace hardware might be useful.
Suggest you read::
https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/
background/ nan-propagation.pdf
For conversation about LoBs versus HoBs.
Okay, it sounds like there are good reasons to use the HoBs. But I
think it is only when converting precisions that it makes a
difference. I have the float package moving the LoBs of a larger
precision to the LoBs of the lower precision if a NaN (or infinity) is
present. I do not think this consumes any more logic. It looks like
just wires. It looks to be a three bit mux on the low order bits going
the other way.
I suppose I could code the package to accept NaN values either way.
The following NaN values are in use.
`define QSUBINFD 63'h7FF0000000000001 // - infinity - infinity
`define QINFDIVD 63'h7FF0000000000002 // - infinity / infinity >> `define QZEROZEROD 63'h7FF0000000000003 // - zero / zero
`define QINFZEROD 63'h7FF0000000000004 // - infinity X zero
`define QSQRTINFD 63'h7FF0000000000005 // - square root of infinity
`define QSQRTNEGD 63'h7FF0000000000006 // - square root of
negaitve number
the NaN that will fit. The middle bits are dropped.
There are rules when more than 1 NaN are an operand to an instruction >>>>> designed to leave the more important NaN as the result. {Where moreHopefully the package follows the rules correctly. NaN operation is one >>>> thing not tested yet.
important is generally the first to be generated.}
It would be better, but some software is so complex it may be unknownThis
would allow detection of the use of a lower precision value in
arithmetic. Suppose a convert from single to double precision is
being
done, but the value to be converted is only half precision. If it >>>>>> were
indicated by the NaN software might be able to fix the result.
I think it is better to fix the SW that thinks a (half) is a (float). >>>>>
the values coming in. The SW does not really need to croak if its a
lower precision value as they are always represent-able in a higher
precision.>>
I also
preserve the sign bit of the number in the NaN box.
Thomas Koenig <tkoenig@netcologne.de> writes:
I recently heard that CS graduates from ETH Zürich had heard about >>pipelines, but thought it was fetch-decode-execute.
Why would a CS graduate need to know about pipelines?
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
I recently heard that CS graduates from ETH Zürich had heard about >>>pipelines, but thought it was fetch-decode-execute.
Why would a CS graduate need to know about pipelines?
Why would a chemical engineer know the basics of heat transfer?
They are going to use commercial programs to design them anyway.
Why would anybody know the basics of what they are doing?
Thomas Koenig <tkoenig@netcologne.de> writes:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
I recently heard that CS graduates from ETH Zürich had heard about >>>>pipelines, but thought it was fetch-decode-execute.
Why would a CS graduate need to know about pipelines?
So they can properly simluate a pipelined processor?
When I got my MSCS, computer engineering courses were
required, including basic logic elements and overviews
of processor design.
Why would anybody know the basics of what they are doing?
Indeed, a programmer that doesn't understand the underlying
hardware is crippled.
On 2025-11-22 10:20 p.m., MitchAlsup wrote:
Robert Finch <robfi680@gmail.com> posted:
On 2025-11-11 2:30 p.m., MitchAlsup wrote:
My float package puts the cause in the 3 LoBs. The cause is always in
Robert Finch <robfi680@gmail.com> posted:
Typical process for NaN boxing is to set the high order bits of theAny FP value representable in lower precision can be exactly represented >>> in higher precision.
value which causes the value to appear to be a NaN at higher precision. >>>
I have been thinking about using some of the high order bits of the NaN >>>> (eg bits 32 to 51) to indicate the precision of the boxed value.
When My 66000 generates a NaN it inserts the cause in the 3 HoBs and
inserts IP in the LoBs. Nothing prevents you from overwriting the NaN, >>> but I thought it was best to point at the causing-instruction and an
encoded "why" the nan was generated. The cause is a 3-bit index to the >>> 7 defined IEEE exceptions.
the low order bits of the register then, even when the precision is
different. But the address is not tracked. The package does not have
access to the address. Seems like NaN trace hardware might be useful.
Suggest you read:: https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf
For conversation about LoBs versus HoBs.
Okay, it sounds like there are good reasons to use the HoBs. But I think
it is only when converting precisions that it makes a difference. I have
the float package moving the LoBs of a larger precision to the LoBs of
the lower precision if a NaN (or infinity) is present. I do not think
this consumes any more logic. It looks like just wires. It looks to be a three bit mux on the low order bits going the other way.
I suppose I could code the package to accept NaN values either way.
The following NaN values are in use.
`define QSUBINFD 63'h7FF0000000000001 // - infinity - infinity
`define QINFDIVD 63'h7FF0000000000002 // - infinity / infinity `define QZEROZEROD 63'h7FF0000000000003 // - zero / zero
`define QINFZEROD 63'h7FF0000000000004 // - infinity X zero
`define QSQRTINFD 63'h7FF0000000000005 // - square root of infinity `define QSQRTNEGD 63'h7FF0000000000006 // - square root of negaitve number
There are rules when more than 1 NaN are an operand to an instructionHopefully the package follows the rules correctly. NaN operation is one
designed to leave the more important NaN as the result. {Where more
important is generally the first to be generated.}
thing not tested yet.
It would be better, but some software is so complex it may be unknownThis >>>> would allow detection of the use of a lower precision value in
arithmetic. Suppose a convert from single to double precision is being >>>> done, but the value to be converted is only half precision. If it were >>>> indicated by the NaN software might be able to fix the result.
I think it is better to fix the SW that thinks a (half) is a (float).
the values coming in. The SW does not really need to croak if its a
lower precision value as they are always represent-able in a higher
precision.>>
I also
preserve the sign bit of the number in the NaN box.
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
I recently heard that CS graduates from ETH Zürich had heard about >>pipelines, but thought it was fetch-decode-execute.
Why would a CS graduate need to know about pipelines?
Why would a chemical engineer know the basics of heat transfer?
They are going to use commercial programs to design them anyway.
Why would anybody know the basics of what they are doing?
ERROR "unexpected byte sequence starting at index 199: '\xC3'" while decoding:
Thomas Koenig <tkoenig@netcologne.de> writes:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
I recently heard that CS graduates from ETH Zürich had heard about >>>pipelines, but thought it was fetch-decode-execute.
Why would a CS graduate need to know about pipelines?
So they can properly simluate a pipelined processor?
When I got my MSCS, computer engineering courses were
required, including basic logic elements and overviews
of processor design.
Why would a chemical engineer know the basics of heat transfer?
They are going to use commercial programs to design them anyway.
Why would anybody know the basics of what they are doing?
Indeed, a programmer that doesn't understand the underlying
hardware is crippled.
Just today, I compiled
u4 = u1/10;
u3 = u1%10;
(plus some surrounding code) with gcc-14 in three contexts. Here's
the code for two of them (the third one is similar to the second one):
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Just today, I compiled
u4 = u1/10;
u3 = u1%10;
(plus some surrounding code) with gcc-14 in three contexts. Here's
the code for two of them (the third one is similar to the second one):
Care for to present a self-contained example? Otherwise, your
example and its analyis are meaingless to the reader.
Robert Finch <robfi680@gmail.com> posted:
On 2025-11-22 10:20 p.m., MitchAlsup wrote:
Robert Finch <robfi680@gmail.com> posted:
On 2025-11-11 2:30 p.m., MitchAlsup wrote:
My float package puts the cause in the 3 LoBs. The cause is always in
Robert Finch <robfi680@gmail.com> posted:
Typical process for NaN boxing is to set the high order bits of the >>>>>> value which causes the value to appear to be a NaN at higher precision. >>>>>Any FP value representable in lower precision can be exactly represented >>>>> in higher precision.
I have been thinking about using some of the high order bits of the NaN >>>>>> (eg bits 32 to 51) to indicate the precision of the boxed value.
When My 66000 generates a NaN it inserts the cause in the 3 HoBs and >>>>> inserts IP in the LoBs. Nothing prevents you from overwriting the NaN, >>>>> but I thought it was best to point at the causing-instruction and an >>>>> encoded "why" the nan was generated. The cause is a 3-bit index to the >>>>> 7 defined IEEE exceptions.
the low order bits of the register then, even when the precision is
different. But the address is not tracked. The package does not have
access to the address. Seems like NaN trace hardware might be useful.
Suggest you read::
https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf
For conversation about LoBs versus HoBs.
Okay, it sounds like there are good reasons to use the HoBs. But I think
it is only when converting precisions that it makes a difference. I have
the float package moving the LoBs of a larger precision to the LoBs of
the lower precision if a NaN (or infinity) is present. I do not think
this consumes any more logic. It looks like just wires. It looks to be a
three bit mux on the low order bits going the other way.
The other part of the paper's reasoning is that if you want to insert
some portion of IP in NaN, doing it bit-reversed enables conversions
to smaller and larger to loose as few bits as possible. The realization
was a surprise to me (yesterday).
I suppose I could code the package to accept NaN values either way.
The following NaN values are in use.
`define QSUBINFD 63'h7FF0000000000001 // - infinity - infinity
`define QINFDIVD 63'h7FF0000000000002 // - infinity / infinity
`define QZEROZEROD 63'h7FF0000000000003 // - zero / zero
`define QINFZEROD 63'h7FF0000000000004 // - infinity X zero
`define QSQRTINFD 63'h7FF0000000000005 // - square root of infinity
`define QSQRTNEGD 63'h7FF0000000000006 // - square root of negaitve number
There are rules when more than 1 NaN are an operand to an instruction >>>>> designed to leave the more important NaN as the result. {Where moreHopefully the package follows the rules correctly. NaN operation is one >>>> thing not tested yet.
important is generally the first to be generated.}
It would be better, but some software is so complex it may be unknownThis >>>>>> would allow detection of the use of a lower precision value in
arithmetic. Suppose a convert from single to double precision is being >>>>>> done, but the value to be converted is only half precision. If it were >>>>>> indicated by the NaN software might be able to fix the result.
I think it is better to fix the SW that thinks a (half) is a (float). >>>>>
the values coming in. The SW does not really need to croak if its a
lower precision value as they are always represent-able in a higher
precision.>>
I also
preserve the sign bit of the number in the NaN box.
scott@slp53.sl.home (Scott Lurndal) writes:
Thomas Koenig <tkoenig@netcologne.de> writes:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
I recently heard that CS graduates from ETH Zürich had heard about >>>>>pipelines, but thought it was fetch-decode-execute.
Why would a CS graduate need to know about pipelines?
So they can properly simluate a pipelined processor?
Sure, if a CS graduate works in an application area, they need to
learn about that application area, whatever it is.
But why would knowledge about processor pipelines be part of their CS >curriculum?
When I got my MSCS, computer engineering courses were
required, including basic logic elements and overviews
of processor design.
For me, too. I even learned something about processor pipelines, in a >specialized elective course.
Why would anybody know the basics of what they are doing?
Processor pipelines are not the basics of what a CS graduate is doing.
They are an implementation detail in computer engineering.
Indeed, a programmer that doesn't understand the underlying
hardware is crippled.
If anything, understanding OoO execution and it's effect on
performance is more relevant. But looking at the dearth of textbooks,
and the fact that Henry Wong did his thesis on his own initiative,
even among computer engineering professors that is a topic that is of
little interest.
Back to programmers: There is also the other POV that programmers
should never concern themselves with low-level details and should
always leave that to compilers, which supposedly can do all those
things better than programmers (I call that the compiler supremacy
position). Compiler supremacy is wishful thinking, but wishful
thinking has a strong influence in the world.
A few more examples where compilers are not as good as even I expected:
Just today, I compiled
u4 = u1/10;
u3 = u1%10;
(plus some surrounding code) with gcc-14 in three contexts. Here's
the code for two of them (the third one is similar to the second one):
movabs $0xcccccccccccccccd,%rax movabs $0xcccccccccccccccd,%rsi
sub $0x8,%r13 mov %r8,%rax
mul %r8 mov %r8,%rcx
mov %rdx,%rax mul %rsi
shr $0x3,%rax shr $0x3,%rdx
lea (%rax,%rax,4),%rdx lea (%rdx,%rdx,4),%rax
add %rdx,%rdx add %rax,%rax
sub %rdx,%r8 sub %rax,%r8
mov %r8,0x8(%r13) mov %rcx,%rax
mov %rax,%r8 mul %rsi
shr $0x3,%rdx
mov %rdx,%r9
The major difference is that in the left context, u3 is stored into
memory (at 0x8(%r13)), while in the right context, it stays in a
register. In the left context, gcc managed to base its computation of
u1%10 on the result of u1/10; in the right context, gcc first computes
u1%10 (computing u1/10 as part of that), and then computes u1/10
again.
On 2025-11-23 3:13 p.m., MitchAlsup wrote:
Robert Finch <robfi680@gmail.com> posted:
On 2025-11-22 10:20 p.m., MitchAlsup wrote:
Robert Finch <robfi680@gmail.com> posted:
On 2025-11-11 2:30 p.m., MitchAlsup wrote:Suggest you read::
My float package puts the cause in the 3 LoBs. The cause is always in >>>> the low order bits of the register then, even when the precision is
Robert Finch <robfi680@gmail.com> posted:
Typical process for NaN boxing is to set the high order bits of the >>>>>> value which causes the value to appear to be a NaN at higher precision.
Any FP value representable in lower precision can be exactly represented
in higher precision.
I have been thinking about using some of the high order bits of the NaN
(eg bits 32 to 51) to indicate the precision of the boxed value.
When My 66000 generates a NaN it inserts the cause in the 3 HoBs and >>>>> inserts IP in the LoBs. Nothing prevents you from overwriting the NaN, >>>>> but I thought it was best to point at the causing-instruction and an >>>>> encoded "why" the nan was generated. The cause is a 3-bit index to the >>>>> 7 defined IEEE exceptions.
different. But the address is not tracked. The package does not have >>>> access to the address. Seems like NaN trace hardware might be useful. >>>
https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf
For conversation about LoBs versus HoBs.
Okay, it sounds like there are good reasons to use the HoBs. But I think >> it is only when converting precisions that it makes a difference. I have >> the float package moving the LoBs of a larger precision to the LoBs of
the lower precision if a NaN (or infinity) is present. I do not think
this consumes any more logic. It looks like just wires. It looks to be a >> three bit mux on the low order bits going the other way.
The other part of the paper's reasoning is that if you want to insert
some portion of IP in NaN, doing it bit-reversed enables conversions
to smaller and larger to loose as few bits as possible. The realization
was a surprise to me (yesterday).
It is probably not possible to embed enough IP information in smaller floating-point formats (<=16-bit) to be worthwhile. For 32-bit floats
only about 18-bits of the address can be stored. It looks like different formats are going to handle NaNs differently, which I find somewhat undesirable.
I am now leaning towards allocating four HOB bits to indicate the NaN
cause, and then filling the rest of the payload with a bit reversed address. There should be some instruction to extract the NaN cause and address.
I like the bit-reversed address idea. Losing high order address bits is
less of an issue than low order ones.
The extra bit in the NaN cause may be used by software for when access
to the payload area is desired for other purposes.
I still like the idea of a NaN trace facility as an option. Perhaps the debugger logic could trigger a dump to trace on a NaN after a specific address.
I think that just a cause code to indicate multiple NaNs colliding would
be good. With the fused-dot-product there could be up to four NaNs. Some
of the information is going to be lost, so might as well just assign a code.
Insane idea: use more payload bits to record the colliding NaN causes,
then dump it to a CSR somewhere when the address is inserted into the
NaN. The FP status needs to be recorded, so maybe it could be part of
that status record.
My float package does not have access to an address, so it cannot be inserted in the individual modules where the NaN occurs. It must be
inserted at a higher level in the FPU which I believe has access to the instruction address.
Thomas Koenig <tkoenig@netcologne.de> writes:
Power's not dead, either, if very highly priced.
New Power CPUs and machines based on them are released regularly. I
think there is enough business in the iSeries (or whatever its current
name) is to produce enough money for the costs of that development.
pSeries benefits from that. I guess that the profits from that are
enough to finance the development of the pSeries machines, but can
contribute little to finance the development of the CPUs.
MIPS is still
being sold, apparently.
From <https://en.wikipedia.org/wiki/MIPS_architecture>:
|In March 2021, MIPS announced that the development of the MIPS
|architecture had ended as the company is making the transition to
|RISC-V.
So it's the same status as SPARC. They may be selling to existing
customers, but nobody sane will use MIPS for a new project.
As for RISC-V,
I am not sure how much business they actually generate compared
to others.
I think a lot of embedded RISC-Vs are used, e.g., in WD (and now
Sandisk) HDDs and SSDs; so you can look at the business reports of WD
if you want to know how much business they make. As for things you
can actually program, there are a number of SBCs on sale (and we have
one), from the Raspi Pico 2 (where you apparently can use either
ARMv8-M (i.e., ARM T32) or RISC-V (probably some RV32 variant) up to
stuff like the Visionfive V2, several Chinese offerings, and some
Hifive SBCs. The latter are not yet competetive in CPU performance
with the like of RK3588-based SBCs or the Raspi 5, so I expect the
main reason for buying them is to try out RISC-V (we have a Visionfive
V1 for that purpose); still, the fact that there are several offerings indicates that there is nonnegligible revenue there.
In this case, put the cause in a container the instruction drags down
the pipe, and retrieve it when you do have address access to where it
needs to go.
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
IIUC Chinese bought rights to use MIPS architecture
and that goes on.
It seems that main Chinese bet is on RISC-V. They manufacture
a lot of ARM-s, but are not entirely comfortable with it.
That also seems to be the Chinese approach to other technologies:
E.g., they build solar power, coal power, wind power, nuclear power,
hydro power, etc.; and in nuclear power, they built a few of every
kind of Generation III reactor on the market before developing their
own their own designs, some of them based on the Westinghouse AP-1000,
others (Hualong One) based on earlier Chinese Generation II designs.
They are also experimenting with Generation IV and SMR designs.
So, at least in technology, the CP does not pretend to know what's
best.
- anton
On Wed, 26 Nov 2025 07:53:49 GMT...
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
That also seems to be the Chinese approach to other technologies:
E.g., they build solar power, coal power, wind power, nuclear power,
hydro power, etc.; and in nuclear power, they built a few of every
kind of Generation III reactor on the market before developing their
own their own designs, some of them based on the Westinghouse AP-1000,
others (Hualong One) based on earlier Chinese Generation II designs.
They are also experimenting with Generation IV and SMR designs.
So, at least in technology, the CP does not pretend to know what's
best.
Is not it the same as in all big countries except ultra-pro-nuclear
France and ultra-anti-nuclear Germany?
China is just bigger, so capable to build more things simultaneously.
In this case, put the cause in a container the instruction drags down
the pipe, and retrieve it when you do have address access to where it
needs to go.
I may change things to pass the address around in the float package.
Putting the address into the NaN later may cause issues with timing. It
adds a mux into things. May be better to use the original NaN mux in the float modules. May call it a NaN identity field instead of an address.
Modified NaN support in the float package to store to the HOBs.
Survey says:
The Qulps PUSH and POP instructions have room for six register fields. Should one of the fields be used to identify the stack pointer register allowing five registers to be pushed or popped? Or should the stack
pointer register be assumed so that six registers may be pushed or popped?
I think the SP should be identified as PUSH / POP would be the only instructions assuming the SP register. Otherwise any register could be chosen by the compiler.
antispam@fricas.org (Waldek Hebisch) writes:
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
IIUC Chinese bought rights to use MIPS architecture
and that goes on.
None are known to me. LoongSon originally implemented MIPS, but,
according to <https://en.wikipedia.org/wiki/Loongson>:
|Loongson moved to their own processor instruction set architecture
|(ISA) in 2021 with the release of the Loongson 3 5000 series.
This instruction set is called LoongArch, and while it is similar to
MIPS, RISC-V, Alpha, DLX, Nios, it is different enough that Bernd
Paysan wrote a separate assembler and disassembler for it <https://cgit.git.savannah.gnu.org/cgit/gforth.git/tree/arch/loongarch64> rather than copying and modifying the MIPS assembler/disassembler.
It seems that main Chinese bet is on RISC-V. They manufacture
a lot of ARM-s, but are not entirely comfortable with it.
It seems to me that different companies in China use different
architectures. Huawei on ARM, Loongson on Loongarch, some on RISC-V
etc.
That also seems to be the Chinese approach to other technologies:
E.g., they build solar power, coal power, wind power, nuclear power,
hydro power, etc.; and in nuclear power, they built a few of every
kind of Generation III reactor on the market before developing their
own their own designs, some of them based on the Westinghouse AP-1000,
others (Hualong One) based on earlier Chinese Generation II designs.
They are also experimenting with Generation IV and SMR designs.
So, at least in technology, the CP does not pretend to know what's--- Synchronet 3.21a-Linux NewsLink 1.2
best.
- anton
Robert Finch <robfi680@gmail.com> posted:
On 2025-11-22 10:20 p.m., MitchAlsup wrote:
Robert Finch <robfi680@gmail.com> posted:
On 2025-11-11 2:30 p.m., MitchAlsup wrote:
My float package puts the cause in the 3 LoBs. The cause is always in
Robert Finch <robfi680@gmail.com> posted:
Typical process for NaN boxing is to set the high order bits of the >>>>>> value which causes the value to appear to be a NaN at higher precision. >>>>>Any FP value representable in lower precision can be exactly represented >>>>> in higher precision.
I have been thinking about using some of the high order bits of the NaN >>>>>> (eg bits 32 to 51) to indicate the precision of the boxed value.
When My 66000 generates a NaN it inserts the cause in the 3 HoBs and >>>>> inserts IP in the LoBs. Nothing prevents you from overwriting the NaN, >>>>> but I thought it was best to point at the causing-instruction and an >>>>> encoded "why" the nan was generated. The cause is a 3-bit index to the >>>>> 7 defined IEEE exceptions.
the low order bits of the register then, even when the precision is
different. But the address is not tracked. The package does not have
access to the address. Seems like NaN trace hardware might be useful.
Suggest you read::
https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf
For conversation about LoBs versus HoBs.
Okay, it sounds like there are good reasons to use the HoBs. But I think
it is only when converting precisions that it makes a difference. I have
the float package moving the LoBs of a larger precision to the LoBs of
the lower precision if a NaN (or infinity) is present. I do not think
this consumes any more logic. It looks like just wires. It looks to be a
three bit mux on the low order bits going the other way.
The other part of the paper's reasoning is that if you want to insert
some portion of IP in NaN, doing it bit-reversed enables conversions
to smaller and larger to loose as few bits as possible. The realization
was a surprise to me (yesterday).
MitchAlsup wrote:
Robert Finch <robfi680@gmail.com> posted:
On 2025-11-22 10:20 p.m., MitchAlsup wrote:
Robert Finch <robfi680@gmail.com> posted:
On 2025-11-11 2:30 p.m., MitchAlsup wrote:Suggest you read::
My float package puts the cause in the 3 LoBs. The cause is always in >>>> the low order bits of the register then, even when the precision is
Robert Finch <robfi680@gmail.com> posted:
Typical process for NaN boxing is to set the high order bits of the >>>>>> value which causes the value to appear to be a NaN at higher precision.
Any FP value representable in lower precision can be exactly represented
in higher precision.
I have been thinking about using some of the high order bits of the NaN
(eg bits 32 to 51) to indicate the precision of the boxed value.
When My 66000 generates a NaN it inserts the cause in the 3 HoBs and >>>>> inserts IP in the LoBs. Nothing prevents you from overwriting the NaN, >>>>> but I thought it was best to point at the causing-instruction and an >>>>> encoded "why" the nan was generated. The cause is a 3-bit index to the >>>>> 7 defined IEEE exceptions.
different. But the address is not tracked. The package does not have >>>> access to the address. Seems like NaN trace hardware might be useful. >>>
https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf
For conversation about LoBs versus HoBs.
Okay, it sounds like there are good reasons to use the HoBs. But I think >> it is only when converting precisions that it makes a difference. I have >> the float package moving the LoBs of a larger precision to the LoBs of
the lower precision if a NaN (or infinity) is present. I do not think
this consumes any more logic. It looks like just wires. It looks to be a >> three bit mux on the low order bits going the other way.
The other part of the paper's reasoning is that if you want to insert
some portion of IP in NaN, doing it bit-reversed enables conversions
to smaller and larger to loose as few bits as possible. The realization
was a surprise to me (yesterday).
I think I read about IBM's approach years before the 754-2019 process started.
Storing the offending address in byte-reversed order would do pretty
much the same thing, but at lower HW cost, right?
Terje
Robert Finch <robfi680@gmail.com> posted:
The Qulps PUSH and POP instructions have room for six register fields.
Should one of the fields be used to identify the stack pointer register
allowing five registers to be pushed or popped? Or should the stack
pointer register be assumed so that six registers may be pushed or popped?
My 66000 ENTER and EXIT instruction use SP == R31 implicitly. But,
instead of giving it a number of registers, there is a start register
and a stop register, so 1-to-32 regsiters can be saved/restored. The >immediate contains how much stack space to allocate/deallocate.
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
Robert Finch <robfi680@gmail.com> posted:
The Qulps PUSH and POP instructions have room for six register fields.My 66000 ENTER and EXIT instruction use SP == R31 implicitly. But,
Should one of the fields be used to identify the stack pointer register
allowing five registers to be pushed or popped? Or should the stack
pointer register be assumed so that six registers may be pushed or popped? >>
instead of giving it a number of registers, there is a start register
and a stop register, so 1-to-32 regsiters can be saved/restored. The
immediate contains how much stack space to allocate/deallocate.
That seems both confining for the compiler designers and less
useful than the VAX-11 register mask stored in the instruction stream
at the function entry point(s).
On 11/26/25 5:16 PM, Scott Lurndal wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:When the compiler can control the order in which registers are chosen
Robert Finch <robfi680@gmail.com> posted:
The Qulps PUSH and POP instructions have room for six register fields. >>>> Should one of the fields be used to identify the stack pointer register >>>> allowing five registers to be pushed or popped? Or should the stackMy 66000 ENTER and EXIT instruction use SP == R31 implicitly. But,
pointer register be assumed so that six registers may be pushed or popped? >>>
instead of giving it a number of registers, there is a start register
and a stop register, so 1-to-32 regsiters can be saved/restored. The
immediate contains how much stack space to allocate/deallocate.
That seems both confining for the compiler designers and less
useful than the VAX-11 register mask stored in the instruction stream
at the function entry point(s).
to allocate, the ENTER and EXIT stuff works very well.
Robert Finch <robfi680@gmail.com> posted:
In this case, put the cause in a container the instruction drags down
the pipe, and retrieve it when you do have address access to where it
needs to go.
I may change things to pass the address around in the float package.
Putting the address into the NaN later may cause issues with timing. It
adds a mux into things. May be better to use the original NaN mux in the
float modules. May call it a NaN identity field instead of an address.
For example: when a My 66000 instruction needs to raise an exception
the Inst *I argument contains a field I->raised which is set (1<<excpt)
and at the end of the pipe (at retire), t->raised |= I->raised. Where
we have a *t there is also t->ip. So, you don't have to drag Thread *t through all the subroutine calls, but you can easily access t->raised
at the point you do have access to t->ip.
Modified NaN support in the float package to store to the HOBs.
Survey says:
The Qulps PUSH and POP instructions have room for six register fields.
Should one of the fields be used to identify the stack pointer register
allowing five registers to be pushed or popped? Or should the stack
pointer register be assumed so that six registers may be pushed or popped?
My 66000 ENTER and EXIT instruction use SP == R31 implicitly. But,
instead of giving it a number of registers, there is a start register
and a stop register, so 1-to-32 regsiters can be saved/restored. The immediate contains how much stack space to allocate/deallocate.
{{when Safe-Stack is enabled:: Rstart-to-R0 are placed on the inaccessible stack, while R1-to-Rstop are placed on the normal stack.}}
Because the stack is always DoubleWord aligned, the 3-LoBs of the
immediate are used to indicate "special" activities on a couple of
registers {R0, R31, R30}, R31 is rarely saves and reloaded from Stack
but just returned to its previous value by integer arithmetic. FP can
be updated or it can be treated like "just another register". R0 can
be loaded directly to t->ip, or loaded into R0 for stack walk-backs.
The corresponding LDM and STM are seldom used.
I think the SP should be identified as PUSH / POP would be the only
instructions assuming the SP register. Otherwise any register could be
chosen by the compiler.
I started with that philosophy--and begrudgingly went away from it as
a) the compiler took form
b) we started adding instructions to ISA to remove instructions from
code footprint.
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
Robert Finch <robfi680@gmail.com> posted:
The Qulps PUSH and POP instructions have room for six register fields.
Should one of the fields be used to identify the stack pointer register >> allowing five registers to be pushed or popped? Or should the stack
pointer register be assumed so that six registers may be pushed or popped?
My 66000 ENTER and EXIT instruction use SP == R31 implicitly. But,
instead of giving it a number of registers, there is a start register
and a stop register, so 1-to-32 regsiters can be saved/restored. The >immediate contains how much stack space to allocate/deallocate.
That seems both confining for the compiler designers and less
useful than the VAX-11 register mask stored in the instruction stream
at the function entry point(s).
"Brian G. Lucas" <bagel99@gmail.com> writes:
On 11/26/25 5:16 PM, Scott Lurndal wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:When the compiler can control the order in which registers are chosen
Robert Finch <robfi680@gmail.com> posted:
The Qulps PUSH and POP instructions have room for six register fields. >>>> Should one of the fields be used to identify the stack pointer register >>>> allowing five registers to be pushed or popped? Or should the stack
pointer register be assumed so that six registers may be pushed or popped?
My 66000 ENTER and EXIT instruction use SP == R31 implicitly. But,
instead of giving it a number of registers, there is a start register
and a stop register, so 1-to-32 regsiters can be saved/restored. The
immediate contains how much stack space to allocate/deallocate.
That seems both confining for the compiler designers and less
useful than the VAX-11 register mask stored in the instruction stream
at the function entry point(s).
to allocate, the ENTER and EXIT stuff works very well.
They are often, however, constrained by the processor specific ABI
which defines the usage model for registers when multiple languages
are linked to provide code for an application.
When every enter insn that calls the function has that mask,
there is the possibility for strange and difficult to locate errors when a program links with a library function that was built
earlier or with a different version of a (or even different language) compiler and thus the mask is not necessarily correct for the latest
version of the called function.
On 2025-11-26 3:57 p.m., MitchAlsup wrote:
Robert Finch <robfi680@gmail.com> posted:
In this case, put the cause in a container the instruction drags down
the pipe, and retrieve it when you do have address access to where it
needs to go.
I may change things to pass the address around in the float package.
Putting the address into the NaN later may cause issues with timing. It
adds a mux into things. May be better to use the original NaN mux in the >> float modules. May call it a NaN identity field instead of an address.
For example: when a My 66000 instruction needs to raise an exception
the Inst *I argument contains a field I->raised which is set (1<<excpt)
and at the end of the pipe (at retire), t->raised |= I->raised. Where
we have a *t there is also t->ip. So, you don't have to drag Thread *t through all the subroutine calls, but you can easily access t->raised
at the point you do have access to t->ip.
Had trouble reading that, sounds like goobly-goop. But I believe I
figured it out.
Sounds like the address is inserted at the end of the pipe which I am
sure is not the case.
I figured this out: the NaN address must be embedded in the result by
the time the result updates the bypass network and registers so that it
is available to other instructions.
The address is available at the start of the calc from the reservation station entry. Me thinks it must be embedded when the NaN result status
is set, provided there is not already a NaN. The existing (first) NaN
must propagate through.
Modified NaN support in the float package to store to the HOBs.
Survey says:
The Qulps PUSH and POP instructions have room for six register fields.
Should one of the fields be used to identify the stack pointer register
allowing five registers to be pushed or popped? Or should the stack
pointer register be assumed so that six registers may be pushed or popped?
My 66000 ENTER and EXIT instruction use SP == R31 implicitly. But,
instead of giving it a number of registers, there is a start register
and a stop register, so 1-to-32 regsiters can be saved/restored. The immediate contains how much stack space to allocate/deallocate.
{{when Safe-Stack is enabled:: Rstart-to-R0 are placed on the inaccessible stack, while R1-to-Rstop are placed on the normal stack.}}
Because the stack is always DoubleWord aligned, the 3-LoBs of the
immediate are used to indicate "special" activities on a couple of registers {R0, R31, R30}, R31 is rarely saves and reloaded from Stack
but just returned to its previous value by integer arithmetic. FP can
be updated or it can be treated like "just another register". R0 can
be loaded directly to t->ip, or loaded into R0 for stack walk-backs.
The corresponding LDM and STM are seldom used.
I ran out of micro-ops for ENTER and EXIT, so they only save the LR and
FP (on the safe stack). A separate PUSH/POP on safe stack instruction is used.
I figured LDM and STM are not used often enough. PUSH / POP is used in
many places LDM / STM might be.
For context switching a whole bunch of load / store instructions are
used. There is context switching in only a couple of places.
I think the SP should be identified as PUSH / POP would be the only
instructions assuming the SP register. Otherwise any register could be
chosen by the compiler.
I started with that philosophy--and begrudgingly went away from it as
a) the compiler took form
b) we started adding instructions to ISA to remove instructions from
code footprint.
Robert Finch <robfi680@gmail.com> posted:
On 2025-11-26 3:57 p.m., MitchAlsup wrote:
Had trouble reading that, sounds like goobly-goop. But I believe I
Robert Finch <robfi680@gmail.com> posted:
In this case, put the cause in a container the instruction drags down >>>>> the pipe, and retrieve it when you do have address access to where it >>>>> needs to go.
I may change things to pass the address around in the float package.
Putting the address into the NaN later may cause issues with timing. It >>>> adds a mux into things. May be better to use the original NaN mux in the >>>> float modules. May call it a NaN identity field instead of an address.
For example: when a My 66000 instruction needs to raise an exception
the Inst *I argument contains a field I->raised which is set (1<<excpt)
and at the end of the pipe (at retire), t->raised |= I->raised. Where
we have a *t there is also t->ip. So, you don't have to drag Thread *t
through all the subroutine calls, but you can easily access t->raised
at the point you do have access to t->ip.
figured it out.
Sounds like the address is inserted at the end of the pipe which I am
sure is not the case.
I figured this out: the NaN address must be embedded in the result by
the time the result updates the bypass network and registers so that it
is available to other instructions.
The address is available at the start of the calc from the reservation
station entry. Me thinks it must be embedded when the NaN result status
is set, provided there is not already a NaN. The existing (first) NaN
must propagate through.
See last calculation line in the following::
void RunInst( Chip *chip )
{
for( uint64_t i = 0; i < chip->cores; i++ )
{
ContextStack *cpu = &core[i];
uint8_t cs = cpu->cs;
Thread *t;
Inst *I;
uint16_t raised;
if( cpu->interrupt.raised & ((((signed)1)<<63) >> cpu->priority) )
{ // take an interrupt
cpu->cs = cpu->interrupt.cs;
cpu->priority = cpu->interrupt.priority;
t = context[cpu->cs];
t->reg[0] = cpu->interrupt.message;
}
else if( raised = t->raised & t->enabled )
{ // take an exception
cpu->cs--;
t = context[cpu->cs];
t->reg[0] = FT1( raised ) | EXCPT;
t->reg[1] = I->inst;
t->reg[2] = I->src1;
t->reg[3] = I->src2;
t->reg[4] = I->src3;
}
else
{ // run an instruction
t = context[cpu->cs];
memory( FETCH, t->ip, &I->inst );
t->ip += 4;
majorTable[ I->inst.major ]( t, I );
t->raised |= I->raised; // propagate raised here
}
}
}
I ran out of micro-ops for ENTER and EXIT, so they only save the LR andModified NaN support in the float package to store to the HOBs.My 66000 ENTER and EXIT instruction use SP == R31 implicitly. But,
Survey says:
The Qulps PUSH and POP instructions have room for six register fields. >>>> Should one of the fields be used to identify the stack pointer register >>>> allowing five registers to be pushed or popped? Or should the stack
pointer register be assumed so that six registers may be pushed or popped? >>>
instead of giving it a number of registers, there is a start register
and a stop register, so 1-to-32 regsiters can be saved/restored. The
immediate contains how much stack space to allocate/deallocate.
{{when Safe-Stack is enabled:: Rstart-to-R0 are placed on the inaccessible >>> stack, while R1-to-Rstop are placed on the normal stack.}}
Because the stack is always DoubleWord aligned, the 3-LoBs of the
immediate are used to indicate "special" activities on a couple of
registers {R0, R31, R30}, R31 is rarely saves and reloaded from Stack
but just returned to its previous value by integer arithmetic. FP can
be updated or it can be treated like "just another register". R0 can
be loaded directly to t->ip, or loaded into R0 for stack walk-backs.
The corresponding LDM and STM are seldom used.
FP (on the safe stack). A separate PUSH/POP on safe stack instruction is
used.
I figured LDM and STM are not used often enough. PUSH / POP is used in
many places LDM / STM might be.
Its a fine line.
I found more uses for an instruction that moves a number of registers randomly allocated to fixed positions (arguments to a call) than to
move random string of registers to/from memory.
.
MOV R1,R10
MOV R2,R25
MOV R3,R17
CALL Subroutine
. ; deal with any result
For context switching a whole bunch of load / store instructions are
used. There is context switching in only a couple of places.
I use a cache-model for thread-state {program-status-line and the
register file}.
The high level simulator, leaves all of the context in memory without
loading it or storing it. Thus this serves as a pipeline Oracle so if
the OoO pipeline makes a timing error, the Oracle stops the thread in
its tracks.
Thus::
.
.
-----interrupt detected
. change CS (cs--) <---
. access threadState[cs]
. t->ip = dispatcher
. t->reg[0] = why
dispatcher in control
.
.
.
RET
SVR
.
.
In your typical interrupt/exception control transfers, there is
no code to actually switch state. Just like there is no code to
switch a cache line that takes a miss.
(*) The cs-- is all that is necessary to change from one Thread State
to another in its entirety.
I think the SP should be identified as PUSH / POP would be the only
instructions assuming the SP register. Otherwise any register could be >>>> chosen by the compiler.
I started with that philosophy--and begrudgingly went away from it as
a) the compiler took form
b) we started adding instructions to ISA to remove instructions from
code footprint.
On 2025-11-23 3:13 p.m., MitchAlsup wrote:
Robert Finch <robfi680@gmail.com> posted:
On 2025-11-22 10:20 p.m., MitchAlsup wrote:
Robert Finch <robfi680@gmail.com> posted:
On 2025-11-11 2:30 p.m., MitchAlsup wrote:Suggest you read::
Robert Finch <robfi680@gmail.com> posted:My float package puts the cause in the 3 LoBs. The cause is always in >>>>> the low order bits of the register then, even when the precision is
Typical process for NaN boxing is to set the high order bits of the >>>>>>> value which causes the value to appear to be a NaN at higher precision. >>>>>>Any FP value representable in lower precision can be exactly represented >>>>>> in higher precision.
I have been thinking about using some of the high order bits of the NaN >>>>>>> (eg bits 32 to 51) to indicate the precision of the boxed value.
When My 66000 generates a NaN it inserts the cause in the 3 HoBs and >>>>>> inserts IP in the LoBs. Nothing prevents you from overwriting the NaN, >>>>>> but I thought it was best to point at the causing-instruction and an >>>>>> encoded "why" the nan was generated. The cause is a 3-bit index to the >>>>>> 7 defined IEEE exceptions.
different. But the address is not tracked. The package does not have >>>>> access to the address. Seems like NaN trace hardware might be useful. >>>>
https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf
For conversation about LoBs versus HoBs.
Okay, it sounds like there are good reasons to use the HoBs. But I think >>> it is only when converting precisions that it makes a difference. I have >>> the float package moving the LoBs of a larger precision to the LoBs of
the lower precision if a NaN (or infinity) is present. I do not think
this consumes any more logic. It looks like just wires. It looks to be a >>> three bit mux on the low order bits going the other way.
The other part of the paper's reasoning is that if you want to insert
some portion of IP in NaN, doing it bit-reversed enables conversions
to smaller and larger to loose as few bits as possible. The realization
was a surprise to me (yesterday).
It is probably not possible to embed enough IP information in smaller >floating-point formats (<=16-bit) to be worthwhile. For 32-bit floats
only about 18-bits of the address can be stored. It looks like different >formats are going to handle NaNs differently, which I find somewhat >undesirable.
Robert Finch <robfi680@gmail.com> posted:
My float package puts the cause in the 3 LoBs. The cause is always in
the low order bits of the register then, even when the precision is
different. But the address is not tracked. The package does not have
access to the address. Seems like NaN trace hardware might be useful.
Suggest you read:: >https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf
For conversation about LoBs versus HoBs.
In article <1763868010-5857@newsgrouper.org>,
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Robert Finch <robfi680@gmail.com> posted:
My float package puts the cause in the 3 LoBs. The cause is always
in the low order bits of the register then, even when the
precision is different. But the address is not tracked. The
package does not have access to the address. Seems like NaN trace
hardware might be useful.
Suggest you read:: >https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf
For conversation about LoBs versus HoBs.
I wasn't sure where to join the NaN conversation, but this seems like
a good spot.
We've had 40+ years of different architectures handling NaNs, (what to
encode in them to indicate where the first problem occurred) and all architectures do something different when operating on two NaNs:
From that paper:
- Intel using x87 instructions: NaN2 if both quiet, NaN1 if NaN2 is signalling
- Intel using SSE instructions: NaN1
- AMD using x87 instructions: NaN2
- AMD using SSE instructions: NaN1
- IBM Power PC: NaN1
- IBM Z mainframe: NaN1 if both quiet, [precedence] to signalling NaN
- ARM: NaN1 if both quiet, [precedence] to signalling NaN
And adding one more not in that paper:
- RISC-V: Always returns canonical NaN only, for Single: 0x7fc00000
I'll just say whatever your NaN handling is, for the source code:
A = B + C + D + E
then for whatever values B,C,D,E having NaN or not, the value of A
should be well defined and not dependent on the order of operations.
How can you use bits in the NaN value for debugging if the hardware
is returning arbitrary results when NaNs collide? Users have almost
no control over whether A = B + C treats B as the first argument or
the second.
I think encoding stuff in NaN is a very 80's idea: turning on
exceptions costs performance, so we want to debug after-the-fact
using NaNs.
But I think RISC-V has the right modern idea: make hardware fast so
you can simply always enable Invalid Operation Traps (and maybe
Overflow, if infinities are happening), and then stop right at the
point of NaN being first created. So the NaN propagation doesn't
matter.
I think the common current debug strategy for NaNs is run at full
speed with exceptions masked, and if you get NaNs in your answer, you
re-run with exceptions on and then debug the traps that occur. And
no one looks at the NaN values at all, just their presence.
So rather than spending time on NaN encoding, make it so that FP
performance is not affected by enabling exceptions, so we can skip
the re-running step, and just run with Invalid Operations trapping
enabled. And then just return canonical NaNs.
Kent
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
Robert Finch <robfi680@gmail.com> posted:
The Qulps PUSH and POP instructions have room for six register fields.My 66000 ENTER and EXIT instruction use SP == R31 implicitly. But,
Should one of the fields be used to identify the stack pointer register >>> allowing five registers to be pushed or popped? Or should the stack
pointer register be assumed so that six registers may be pushed or popped? >>
instead of giving it a number of registers, there is a start register
and a stop register, so 1-to-32 regsiters can be saved/restored. The >>immediate contains how much stack space to allocate/deallocate.
That seems both confining for the compiler designers and less
useful than the VAX-11 register mask stored in the instruction stream
at the function entry point(s).
I'll just say whatever your NaN handling is, for the source code:
A = B + C + D + E
then for whatever values B,C,D,E having NaN or not, the value of A should
be well defined and not dependent on the order of operations.
In article <1763868010-5857@newsgrouper.org>,you can
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Robert Finch <robfi680@gmail.com> posted:
My float package puts the cause in the 3 LoBs. The cause is always in
the low order bits of the register then, even when the precision is
different. But the address is not tracked. The package does not have
access to the address. Seems like NaN trace hardware might be useful.
Suggest you read::
https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf
For conversation about LoBs versus HoBs.
I wasn't sure where to join the NaN conversation, but this seems like a
good spot.
We've had 40+ years of different architectures handling NaNs, (what to
encode in them to indicate where the first problem occurred) and all architectures do something different when operating on two NaNs:
From that paper:
- Intel using x87 instructions: NaN2 if both quiet, NaN1 if NaN2 is signalling
- Intel using SSE instructions: NaN1
- AMD using x87 instructions: NaN2
- AMD using SSE instructions: NaN1
- IBM Power PC: NaN1
- IBM Z mainframe: NaN1 if both quiet, [precedence] to signalling NaN
- ARM: NaN1 if both quiet, [precedence] to signalling NaN
And adding one more not in that paper:
- RISC-V: Always returns canonical NaN only, for Single: 0x7fc00000
I'll just say whatever your NaN handling is, for the source code:
A = B + C + D + E
then for whatever values B,C,D,E having NaN or not, the value of A should
be well defined and not dependent on the order of operations. How can you use bits in the NaN value for debugging if the hardware is returning arbitrary
results when NaNs collide? Users have almost no control over whether
A = B + C treats B as the first argument or the second.
I think encoding stuff in NaN is a very 80's idea: turning on exceptions costs performance, so we want to debug after-the-fact using NaNs.
But I think RISC-V has the right modern idea: make hardware fast so
simply always enable Invalid Operation Traps (and maybe Overflow, if infinities are happening), and then stop right at the point of NaN being first created. So the NaN propagation doesn't matter.
I think the common current debug strategy for NaNs is run at full speed
with exceptions masked, and if you get NaNs in your answer, you re-run
with exceptions on and then debug the traps that occur. And no one looks at the NaN values at all, just their presence.
So rather than spending time on NaN encoding, make it so that FP performance is not affected by enabling exceptions, so we can skip the re-running step, and just run with Invalid Operations trapping enabled. And then just
return canonical NaNs.
Kent
scott@slp53.sl.home (Scott Lurndal) posted:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
My 66000 ENTER and EXIT instruction use SP == R31 implicitly. But,
instead of giving it a number of registers, there is a start register
and a stop register, so 1-to-32 regsiters can be saved/restored. The
immediate contains how much stack space to allocate/deallocate.
That seems both confining for the compiler designers and less
useful than the VAX-11 register mask stored in the instruction stream
at the function entry point(s).
We, and by that I mean Brian, have not found that so. In the early stages
we did see a bit of that, and then Brian found a way to allocate registers >from R31-down-to-R16 that fit the ENTER/EXIT model and we find essentially >nothing (that is no more instructions in the stream than necessary).
Part of the distinction is::
a) how arguments/results are passed to/from subroutines.
b) having a minimum of 7-temporary registers at entry point.
c) how the stack frame is designed/allocated wrt:
1) my arguments and my results,
2) his arguments and his results,
3) varargs,
4) dynamic arrays on stack,
5) stack frame allocation at ENTER,
d) freedom to use R30 as FP or as joe-random-register.
These were all co-designed together, after much of the instruction
emission logic was sorted out.
Consider this as a VAX CALL model except that the mask was replaced by
a list of registers, which were then packed towards R31 instead of a bit >vector.
On 2025-11-27 10:50 a.m., Kent Dickey wrote:
I think encoding stuff in NaN is a very 80's idea: turning on exceptions
costs performance, so we want to debug after-the-fact using NaNs.
But I think RISC-V has the right modern idea: make hardware fast so
you can
simply always enable Invalid Operation Traps (and maybe Overflow, if
infinities are happening), and then stop right at the point of NaN being
first created. So the NaN propagation doesn't matter.
I think the common current debug strategy for NaNs is run at full speed
with exceptions masked, and if you get NaNs in your answer, you re-run
with exceptions on and then debug the traps that occur. And no one
looks at
the NaN values at all, just their presence.
So rather than spending time on NaN encoding, make it so that FP
performance
is not affected by enabling exceptions, so we can skip the re-running
step,
and just run with Invalid Operations trapping enabled. And then just
return canonical NaNs.
Kent
I do not know how one would make FP performance improve and have
exceptions at the same time. The FP would have to operate asynchronous.
The only thing I can think of is to have core(s) specifically dedicated
to performance FP that do not service interrupts.
On 2025-11-26 7:08 p.m., MitchAlsup wrote:
Robert Finch <robfi680@gmail.com> posted:
On 2025-11-26 3:57 p.m., MitchAlsup wrote:
Had trouble reading that, sounds like goobly-goop. But I believe I
Robert Finch <robfi680@gmail.com> posted:
For example: when a My 66000 instruction needs to raise an exceptionIn this case, put the cause in a container the instruction drags down >>>>> the pipe, and retrieve it when you do have address access to where it >>>>> needs to go.
I may change things to pass the address around in the float package. >>>> Putting the address into the NaN later may cause issues with timing. It >>>> adds a mux into things. May be better to use the original NaN mux in the >>>> float modules. May call it a NaN identity field instead of an address. >>>
the Inst *I argument contains a field I->raised which is set (1<<excpt) >>> and at the end of the pipe (at retire), t->raised |= I->raised. Where
we have a *t there is also t->ip. So, you don't have to drag Thread *t >>> through all the subroutine calls, but you can easily access t->raised
at the point you do have access to t->ip.
figured it out.
Sounds like the address is inserted at the end of the pipe which I am
sure is not the case.
I figured this out: the NaN address must be embedded in the result by
the time the result updates the bypass network and registers so that it
is available to other instructions.
The address is available at the start of the calc from the reservation
station entry. Me thinks it must be embedded when the NaN result status
is set, provided there is not already a NaN. The existing (first) NaN
must propagate through.
See last calculation line in the following::
void RunInst( Chip *chip )
{
for( uint64_t i = 0; i < chip->cores; i++ )
{
ContextStack *cpu = &core[i];
uint8_t cs = cpu->cs;
Thread *t;
Inst *I;
uint16_t raised;
if( cpu->interrupt.raised & ((((signed)1)<<63) >> cpu->priority) )
{ // take an interrupt
cpu->cs = cpu->interrupt.cs;
cpu->priority = cpu->interrupt.priority;
t = context[cpu->cs];
t->reg[0] = cpu->interrupt.message;
}
else if( raised = t->raised & t->enabled )
{ // take an exception
cpu->cs--;
t = context[cpu->cs];
t->reg[0] = FT1( raised ) | EXCPT;
t->reg[1] = I->inst;
t->reg[2] = I->src1;
t->reg[3] = I->src2;
t->reg[4] = I->src3;
}
else
{ // run an instruction
t = context[cpu->cs];
memory( FETCH, t->ip, &I->inst );
t->ip += 4;
majorTable[ I->inst.major ]( t, I );
t->raised |= I->raised; // propagate raised here
}
}
}
That looks like code for a simulator.
How closely does it follow the operation of the CPU?
I do not see where 'I' is initialized.
It has been a while since I worked on simulator code.
The IP value is just muxed in in a five to one mux for the significand.
Had to account for NaN's infinities and overflow anyway. Address gets propagated with some some flops, but flops are inexpensive in an FPGA.
always_comb
casez({aNan5,bNan5,qNaNOutab5,aInf5,bInf5,overab5})
6'b1?????: moab6 <= {1'b1,1'b1,a5[fp64Pkg::FMSB-1:0],{fp64Pkg::FMSB+1{1'b0}}};
6'b01????: moab6 <= {1'b1,1'b1,b5[fp64Pkg::FMSB-1:0],{fp64Pkg::FMSB+1{1'b0}}};
6'b001???: moab6 <= {1'b1,qNaN|(64'd4 << (fp64Pkg::FMSB-4))|adr5[63:16],{fp64Pkg::FMSB+1{1'b0}}}; // multiply inf
* zero
6'b0001??: moab6 <= 0; // mul inf's
6'b00001?: moab6 <= 0; // mul inf's
6'b000001: moab6 <= 0; // mul overflow
default: moab6 <= fractab5;
endcase
I ran out of micro-ops for ENTER and EXIT, so they only save the LR andModified NaN support in the float package to store to the HOBs.
Survey says:
The Qulps PUSH and POP instructions have room for six register fields. >>>> Should one of the fields be used to identify the stack pointer register >>>> allowing five registers to be pushed or popped? Or should the stack
pointer register be assumed so that six registers may be pushed or popped?
My 66000 ENTER and EXIT instruction use SP == R31 implicitly. But,
instead of giving it a number of registers, there is a start register
and a stop register, so 1-to-32 regsiters can be saved/restored. The
immediate contains how much stack space to allocate/deallocate.
{{when Safe-Stack is enabled:: Rstart-to-R0 are placed on the inaccessible
stack, while R1-to-Rstop are placed on the normal stack.}}
Because the stack is always DoubleWord aligned, the 3-LoBs of the
immediate are used to indicate "special" activities on a couple of
registers {R0, R31, R30}, R31 is rarely saves and reloaded from Stack
but just returned to its previous value by integer arithmetic. FP can
be updated or it can be treated like "just another register". R0 can
be loaded directly to t->ip, or loaded into R0 for stack walk-backs.
The corresponding LDM and STM are seldom used.
FP (on the safe stack). A separate PUSH/POP on safe stack instruction is >> used.
I figured LDM and STM are not used often enough. PUSH / POP is used in
many places LDM / STM might be.
Its a fine line.
I found more uses for an instruction that moves a number of registers randomly allocated to fixed positions (arguments to a call) than to
move random string of registers to/from memory.
.
MOV R1,R10
MOV R2,R25
MOV R3,R17
CALL Subroutine
. ; deal with any result
My 66000 has an instruction to do that?
I'd not seen an instruction like that. It is almost like a byte map. I can see how it could be done.
Another instruction to add to the ISA. My compiler does not do such a
nice job of packing the register moves together though.
For context switching a whole bunch of load / store instructions are
used. There is context switching in only a couple of places.
I use a cache-model for thread-state {program-status-line and the
register file}.
The high level simulator, leaves all of the context in memory without loading it or storing it. Thus this serves as a pipeline Oracle so if
the OoO pipeline makes a timing error, the Oracle stops the thread in
its tracks.
Thus::
.
.
-----interrupt detected
. change CS (cs--) <---
. access threadState[cs]
. t->ip = dispatcher
. t->reg[0] = why
dispatcher in control
.
.
.
RET
SVR
.
.
In your typical interrupt/exception control transfers, there is
no code to actually switch state. Just like there is no code to
switch a cache line that takes a miss.
The My 66000 hardware takes care of it automatically? Interrupts push
and pop context in my system.
(*) The cs-- is all that is necessary to change from one Thread State
to another in its entirety.
I think the SP should be identified as PUSH / POP would be the only
instructions assuming the SP register. Otherwise any register could be >>>> chosen by the compiler.
I started with that philosophy--and begrudgingly went away from it as
a) the compiler took form
b) we started adding instructions to ISA to remove instructions from
code footprint.
In article <1763868010-5857@newsgrouper.org>,
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Robert Finch <robfi680@gmail.com> posted:
My float package puts the cause in the 3 LoBs. The cause is always in
the low order bits of the register then, even when the precision is
different. But the address is not tracked. The package does not have
access to the address. Seems like NaN trace hardware might be useful.
Suggest you read:: >https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf
For conversation about LoBs versus HoBs.
I wasn't sure where to join the NaN conversation, but this seems like a
good spot.
We've had 40+ years of different architectures handling NaNs, (what to
encode in them to indicate where the first problem occurred) and all architectures do something different when operating on two NaNs:
From that paper:
- Intel using x87 instructions: NaN2 if both quiet, NaN1 if NaN2 is signalling
- Intel using SSE instructions: NaN1
- AMD using x87 instructions: NaN2
- AMD using SSE instructions: NaN1
- IBM Power PC: NaN1
- IBM Z mainframe: NaN1 if both quiet, [precedence] to signalling NaN
- ARM: NaN1 if both quiet, [precedence] to signalling NaN
And adding one more not in that paper:
- RISC-V: Always returns canonical NaN only, for Single: 0x7fc00000
I'll just say whatever your NaN handling is, for the source code:
A = B + C + D + E
then for whatever values B,C,D,E having NaN or not, the value of A should
be well defined and not dependent on the order of operations.
How can you
use bits in the NaN value for debugging if the hardware is returning arbitrary
results when NaNs collide?
Users have almost no control over whether
A = B + C treats B as the first argument or the second.
I think encoding stuff in NaN is a very 80's idea: turning on exceptions costs performance, so we want to debug after-the-fact using NaNs.
But I think RISC-V has the right modern idea: make hardware fast so you can simply always enable Invalid Operation Traps (and maybe Overflow, if infinities are happening), and then stop right at the point of NaN being first created. So the NaN propagation doesn't matter.
I think the common current debug strategy for NaNs is run at full speed
with exceptions masked, and if you get NaNs in your answer, you re-run
with exceptions on and then debug the traps that occur. And no one looks at the NaN values at all, just their presence.
So rather than spending time on NaN encoding, make it so that FP performance is not affected by enabling exceptions, so we can skip the re-running step, and just run with Invalid Operations trapping enabled. And then just
return canonical NaNs.
Kent--- Synchronet 3.21a-Linux NewsLink 1.2
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
scott@slp53.sl.home (Scott Lurndal) posted:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
My 66000 ENTER and EXIT instruction use SP == R31 implicitly. But,
instead of giving it a number of registers, there is a start register
and a stop register, so 1-to-32 regsiters can be saved/restored. The
immediate contains how much stack space to allocate/deallocate.
That seems both confining for the compiler designers and less
useful than the VAX-11 register mask stored in the instruction stream
at the function entry point(s).
We, and by that I mean Brian, have not found that so. In the early stages >we did see a bit of that, and then Brian found a way to allocate registers >from R31-down-to-R16 that fit the ENTER/EXIT model and we find essentially >nothing (that is no more instructions in the stream than necessary).
Part of the distinction is::
a) how arguments/results are passed to/from subroutines.
b) having a minimum of 7-temporary registers at entry point.
c) how the stack frame is designed/allocated wrt:
1) my arguments and my results,
2) his arguments and his results,
3) varargs,
4) dynamic arrays on stack,
5) stack frame allocation at ENTER,
d) freedom to use R30 as FP or as joe-random-register.
These were all co-designed together, after much of the instruction >emission logic was sorted out.
What is "my" and "his"?
Consider this as a VAX CALL model except that the mask was replaced by
a list of registers, which were then packed towards R31 instead of a bit >vector.
Do you need both a start and a stop register?
As far as I understand, ENTER is at the entry point of the callee, and
EXIT is before the return or tail call; actually, the tail call case
answers my question above:
If the tail-caller has m callee-saved registers and the tail-callee
has n callee-saved registers, then
if m>n, generate an EXIT that restores the m-n registers;
if m<n, generate an ENTER that saves the n-m registers;
Generate a jump to behind the ENTER instruction of the callee.
That is, assuming that the tail-callee is in the same compilation unit
as the tail-caller; otherwise the tail-caller needs to do a full EXIT
and then jump to the normal entry point of the tail-callee, which does
a full ENTER.
And in these ENTERs and EXITs, you don't end (or start) at the same
point as in the regular ENTERs and EXITs.
And yes, for saving the callee-saved registers I don't see a need for
a mask. For caller-saved registers, it's different. Consider:
long foo(...)
{
long x = ...;
long y = ...;
long z = ...;
if (...) {
bar(...);
x = ...;
} else if (...){
baz(...);
y = ...;
} else {
bla(...);
z = ...;
}
return x+y+z;
}
Here one could put x, y, and z in callee-saved registers (and use ENTER
and EXIT for them), but that would need to save and later restore
three registers on every path through foo().
Or one could put it in caller-saved registers and save only two
registers on every path through foo(). Then one needs to save y and z
around the call to bar(), x and z around the call to baz(), and x and
y around the call to bla(). For any register allocation, in one of
the cases the registers to be saved are not contiguous. So if one
would use a save-multiple or load-multiple instruction for that, a
mask would be needed.
- anton--- Synchronet 3.21a-Linux NewsLink 1.2
On 2025-11-27 10:50 a.m., Kent Dickey wrote:
In article <1763868010-5857@newsgrouper.org>,
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Robert Finch <robfi680@gmail.com> posted:
My float package puts the cause in the 3 LoBs. The cause is always in
the low order bits of the register then, even when the precision is
different. But the address is not tracked. The package does not have
access to the address. Seems like NaN trace hardware might be useful.
Suggest you read::
https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf
For conversation about LoBs versus HoBs.
I wasn't sure where to join the NaN conversation, but this seems like a good spot.
We've had 40+ years of different architectures handling NaNs, (what to encode in them to indicate where the first problem occurred) and all architectures do something different when operating on two NaNs:
From that paper:
- Intel using x87 instructions: NaN2 if both quiet, NaN1 if NaN2 is signalling
- Intel using SSE instructions: NaN1
- AMD using x87 instructions: NaN2
- AMD using SSE instructions: NaN1
- IBM Power PC: NaN1
- IBM Z mainframe: NaN1 if both quiet, [precedence] to signalling NaN
- ARM: NaN1 if both quiet, [precedence] to signalling NaN
And adding one more not in that paper:
- RISC-V: Always returns canonical NaN only, for Single: 0x7fc00000
I'll just say whatever your NaN handling is, for the source code:
A = B + C + D + E
then for whatever values B,C,D,E having NaN or not, the value of A should be well defined and not dependent on the order of operations. How can you use bits in the NaN value for debugging if the hardware is returning arbitrary
results when NaNs collide? Users have almost no control over whether
A = B + C treats B as the first argument or the second.
I think encoding stuff in NaN is a very 80's idea: turning on exceptions costs performance, so we want to debug after-the-fact using NaNs.you can
But I think RISC-V has the right modern idea: make hardware fast so
simply always enable Invalid Operation Traps (and maybe Overflow, if infinities are happening), and then stop right at the point of NaN being first created. So the NaN propagation doesn't matter.
I think the common current debug strategy for NaNs is run at full speed with exceptions masked, and if you get NaNs in your answer, you re-run
with exceptions on and then debug the traps that occur. And no one looks at
the NaN values at all, just their presence.
So rather than spending time on NaN encoding, make it so that FP performance
is not affected by enabling exceptions, so we can skip the re-running step, and just run with Invalid Operations trapping enabled. And then just return canonical NaNs.
Kent
I do not know how one would make FP performance improve and have
exceptions at the same time. The FP would have to operate asynchronous.
The only thing I can think of is to have core(s) specifically dedicated
to performance FP that do not service interrupts.
Given that nobody looks at the NaN values it is tempting to leave out
the NaN info, but I think I will still have it as an input to modules
where NaNs can be generated (when I get around to it). The NaN info can always be set to zeros then and the extra logic should disappear then.
I think that there may be a reason why nobody looks at the NaN values.
IDK but maybe the debug does not make it easy to spot. A NaN display
with a random assortment of digits is pretty useless. But if debug where
to display all the address and other info, would it get used?
Thomas Koenig <tkoenig@netcologne.de> writes:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Just today, I compiled
u4 = u1/10;
u3 = u1%10;
(plus some surrounding code) with gcc-14 in three contexts. Here's
the code for two of them (the third one is similar to the second one):
Care for to present a self-contained example? Otherwise, your
example and its analyis are meaingless to the reader.
I doubt that a self-contained example will be more meaningful to all
but the most determined readers, but anyway, the preprocessed C code is at
https://www.complang.tuwien.ac.at/anton/tmp/engine-fast.i
Robert Finch wrote:
On 2025-11-27 10:50 a.m., Kent Dickey wrote:
I think encoding stuff in NaN is a very 80's idea: turning on exceptions >> costs performance, so we want to debug after-the-fact using NaNs.
But I think RISC-V has the right modern idea: make hardware fast so
you can
simply always enable Invalid Operation Traps (and maybe Overflow, if
infinities are happening), and then stop right at the point of NaN being >> first created. So the NaN propagation doesn't matter.
I think the common current debug strategy for NaNs is run at full speed
with exceptions masked, and if you get NaNs in your answer, you re-run
with exceptions on and then debug the traps that occur. And no one
looks at
the NaN values at all, just their presence.
So rather than spending time on NaN encoding, make it so that FP
performance
is not affected by enabling exceptions, so we can skip the re-running
step,
and just run with Invalid Operations trapping enabled. And then just
return canonical NaNs.
Kent
I do not know how one would make FP performance improve and have exceptions at the same time. The FP would have to operate asynchronous. The only thing I can think of is to have core(s) specifically dedicated
to performance FP that do not service interrupts.
Why do you think that enabling FP exceptions "costs performance",
by which I assume you mean that, say, an FPADD with exceptions
enabled is slower than disabled?
The FP exceptions are rising-edge triggered based on individual
instruction calculation status, that is before being merged (OR'd)
into the overall FP status. If an FP instruction has unmasked exceptions
then mark the uOp as Except'd and recognize
it at Retire like anyretire
other exception. This also assumes that the overall FP status is
updated (merged) at Retire so it only contains status flags for
FP instructions older than the
point.
(Looking at your
code, it also does not seem to be self-sufficient, at least the
numerous SKIP4 statements require something else).
My assumption is that the control flow is confusing gcc.
For this
to be fixed, somebody with knowledge of the code would need to
cut this down to something that still exhibits the behavior, and
that can be reduced further with cvise (or delta, but cvise is
usually much better).
I hard-coded an IRQ delay down-count in the Qupls4 core. The down-count delays accepting interrupts for ten clock cycles or about 40
instructions if an interrupt got deferred. The interrupt being deferred because interrupts got disabled by an instruction in the pipeline. I
guessed 40 instructions would likely be enough for many cases where IRQs
are disabled then enabled again.
The issue is the pipeline is full of ISR instructions that should not be committed because the IRQs got disabled in the meantime. If the CPU were allowed to accept another IRQ right away, it could get stuck in a loop flushing the pipeline and reloading with the ISR routine code instead of progressing through the code where IRQs were disabled.
I could create a control register for this count and allow it to be programmable. But I think that may not be necessary.
It is possible that 40 instructions is not enough. In that case the CPU would advance in 40 instruction burps. Alternating between fetching ISR instructions and the desired instruction stream. On the other hand, a
larger down-count starts to impact the IRQ latency.
Tradeoffs…
I suppose I could have the CPU increase the down-count if it is looping around fetching ISR instructions. The down-count would be reset to the minimum again once an interrupt enable instruction is executed.
Complex…
kegs@provalid.com (Kent Dickey) posted:[snip]
In article <1763868010-5857@newsgrouper.org>,
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Robert Finch <robfi680@gmail.com> posted:
My float package puts the cause in the 3 LoBs. The cause is always in
the low order bits of the register then, even when the precision is
different. But the address is not tracked. The package does not have
access to the address. Seems like NaN trace hardware might be useful.
Suggest you read::
https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdfFor conversation about LoBs versus HoBs.
I'll just say whatever your NaN handling is, for the source code:
A = B + C + D + E
then for whatever values B,C,D,E having NaN or not, the value of A should
be well defined and not dependent on the order of operations.
I nice philosophy, but how does one achieve that when the compiler is allowed >to encode the above as::
A = (B+C)+(D+E)
or
A = (B+D)+(C+E)
or
A = (B+E)+(C+D)
or
A = (B+C)+(E+D)
or
...
No single set of rules can give the first created NaN because which
is first created is dependent on how the compiler ordered the FADDs.
On 11/29/2025 6:29 AM, Robert Finch wrote:
I hard-coded an IRQ delay down-count in the Qupls4 core. The down-
count delays accepting interrupts for ten clock cycles or about 40
instructions if an interrupt got deferred. The interrupt being
deferred because interrupts got disabled by an instruction in the
pipeline. I guessed 40 instructions would likely be enough for many
cases where IRQs are disabled then enabled again.
The issue is the pipeline is full of ISR instructions that should not
be committed because the IRQs got disabled in the meantime. If the CPU
were allowed to accept another IRQ right away, it could get stuck in a
loop flushing the pipeline and reloading with the ISR routine code
instead of progressing through the code where IRQs were disabled.
I could create a control register for this count and allow it to be
programmable. But I think that may not be necessary.
It is possible that 40 instructions is not enough. In that case the
CPU would advance in 40 instruction burps. Alternating between
fetching ISR instructions and the desired instruction stream. On the
other hand, a larger down-count starts to impact the IRQ latency.
Tradeoffs…
I suppose I could have the CPU increase the down-count if it is
looping around fetching ISR instructions. The down-count would be
reset to the minimum again once an interrupt enable instruction is
executed.
Complex…
A simple alternative that I have seen is to have an instruction that
enables interrupts and jumps to somewhere, probably either the
interrupted code or the dispatcher that might do a full context switch.
The ISR would issue this instruction when it has saved everything that
is necessary to handle the interrupt and thus could be interrupted
again. This minimized the time interrupts are locked out without the
need for an arbitrary timer, etc.
I hard-coded an IRQ delay down-count in the Qupls4 core. The down-count delays accepting interrupts for ten clock cycles or about 40
instructions if an interrupt got deferred. The interrupt being deferred because interrupts got disabled by an instruction in the pipeline. I
guessed 40 instructions would likely be enough for many cases where IRQs
are disabled then enabled again.
The issue is the pipeline is full of ISR instructions that should not be committed because the IRQs got disabled in the meantime. If the CPU were allowed to accept another IRQ right away, it could get stuck in a loop flushing the pipeline and reloading with the ISR routine code instead of progressing through the code where IRQs were disabled.
I could create a control register for this count and allow it to be programmable. But I think that may not be necessary.
It is possible that 40 instructions is not enough. In that case the CPU would advance in 40 instruction burps. Alternating between fetching ISR instructions and the desired instruction stream. On the other hand, a
larger down-count starts to impact the IRQ latency.
Tradeoffs…
I suppose I could have the CPU increase the down-count if it is looping around fetching ISR instructions. The down-count would be reset to the minimum again once an interrupt enable instruction is executed.
Complex…
This is my point: I don't see a great way to encode the first NaN, which
is why I propose not making that a goal. You're not getting the first
NaN in any case even if you try to do so in hardware, since the order of operations is a fragile thing that's hard to control unless you write assembly code, or the most tedious source code imaginable.
On 11/29/2025 6:29 AM, Robert Finch wrote:
I hard-coded an IRQ delay down-count in the Qupls4 core. The down-count delays accepting interrupts for ten clock cycles or about 40
instructions if an interrupt got deferred. The interrupt being deferred because interrupts got disabled by an instruction in the pipeline. I guessed 40 instructions would likely be enough for many cases where IRQs are disabled then enabled again.
The issue is the pipeline is full of ISR instructions that should not be committed because the IRQs got disabled in the meantime. If the CPU were allowed to accept another IRQ right away, it could get stuck in a loop flushing the pipeline and reloading with the ISR routine code instead of progressing through the code where IRQs were disabled.
I could create a control register for this count and allow it to be programmable. But I think that may not be necessary.
It is possible that 40 instructions is not enough. In that case the CPU would advance in 40 instruction burps. Alternating between fetching ISR instructions and the desired instruction stream. On the other hand, a larger down-count starts to impact the IRQ latency.
Tradeoffs…
I suppose I could have the CPU increase the down-count if it is looping around fetching ISR instructions. The down-count would be reset to the minimum again once an interrupt enable instruction is executed.
Complex…
A simple alternative that I have seen is to have an instruction that
enables interrupts and jumps to somewhere, probably either the
interrupted code or the dispatcher that might do a full context switch.
The ISR would issue this instruction when it has saved everything that
is necessary to handle the interrupt and thus could be interrupted
again. This minimized the time interrupts are locked out without the
need for an arbitrary timer, etc.
Kent Dickey <kegs@provalid.com> schrieb:
This is my point: I don't see a great way to encode the first NaN, which
is why I propose not making that a goal. You're not getting the first
NaN in any case even if you try to do so in hardware, since the order of
operations is a fragile thing that's hard to control unless you write
assembly code, or the most tedious source code imaginable.
Using Fortran, parentheses have to be honored. If you write
A = (B + C) + (D + E)
then B + C and D + E have to be calculated before the total sum.
If you write
A = B + (C + (D + E))
then you prescribe the order completetely.
I can imagine source code that is much more tedious than this :-)
Robert Finch <robfi680@gmail.com> posted:
I hard-coded an IRQ delay down-count in the Qupls4 core. The down-count
delays accepting interrupts for ten clock cycles or about 40
instructions if an interrupt got deferred. The interrupt being deferred
because interrupts got disabled by an instruction in the pipeline. I
guessed 40 instructions would likely be enough for many cases where IRQs
are disabled then enabled again.
The issue is the pipeline is full of ISR instructions that should not be
committed because the IRQs got disabled in the meantime. If the CPU were
allowed to accept another IRQ right away, it could get stuck in a loop
flushing the pipeline and reloading with the ISR routine code instead of
progressing through the code where IRQs were disabled.
The above is one of the reasons EricP supports the pipeline notion that interrupts do NOT flush the pipe. Instead, the instruction in the pipe
are allowed to retire (apace) and new instructions are inserted from
the interrupt service point.
can deliver their results to their registers, and update µArchitectural state they "own", there is no reason to flush--AND--no corresponding
reason to delay "taking" the interrupt.
At the µArchitectural level, you, the designer, see both the front
and the end of the pipeline, you can change what goes in the front
and allow what was already in the pipe to come out the back. This
requires dragging a small amount of information down the pipe, much
like multi-threaded CPUs.
I could create a control register for this count and allow it to beMake the problem "go away". You will be happier in the end.
programmable. But I think that may not be necessary.
It is possible that 40 instructions is not enough. In that case the CPU
would advance in 40 instruction burps. Alternating between fetching ISR
instructions and the desired instruction stream. On the other hand, a
larger down-count starts to impact the IRQ latency.
Tradeoffs…
I suppose I could have the CPU increase the down-count if it is looping
around fetching ISR instructions. The down-count would be reset to the
minimum again once an interrupt enable instruction is executed.
Complex…
I hard-coded an IRQ delay down-count in the Qupls4 core. The down-count delays accepting interrupts for ten clock cycles or about 40
instructions if an interrupt got deferred. The interrupt being deferred because interrupts got disabled by an instruction in the pipeline. I
guessed 40 instructions would likely be enough for many cases where IRQs
are disabled then enabled again.
The issue is the pipeline is full of ISR instructions that should not be committed because the IRQs got disabled in the meantime. If the CPU were allowed to accept another IRQ right away, it could get stuck in a loop flushing the pipeline and reloading with the ISR routine code instead of progressing through the code where IRQs were disabled.
I could create a control register for this count and allow it to be programmable. But I think that may not be necessary.
It is possible that 40 instructions is not enough. In that case the CPU would advance in 40 instruction burps. Alternating between fetching ISR instructions and the desired instruction stream. On the other hand, a
larger down-count starts to impact the IRQ latency.
Tradeoffs…
I suppose I could have the CPU increase the down-count if it is looping around fetching ISR instructions. The down-count would be reset to the minimum again once an interrupt enable instruction is executed.
Complex…
Thomas Koenig wrote:
Kent Dickey <kegs@provalid.com> schrieb:
This is my point: I don't see a great way to encode the first NaN, which >> is why I propose not making that a goal. You're not getting the first
NaN in any case even if you try to do so in hardware, since the order of >> operations is a fragile thing that's hard to control unless you write
assembly code, or the most tedious source code imaginable.
Using Fortran, parentheses have to be honored. If you write
A = (B + C) + (D + E)
then B + C and D + E have to be calculated before the total sum.
If you write
A = B + (C + (D + E))
then you prescribe the order completetely.
I can imagine source code that is much more tedious than this :-)
That doesn't control which variable is assigned to each source operand.
If both operands were Nan's and the two-Nan-rule was "always take src1"
then the choice of which to propagate would still be non-deterministic.
On 2025-11-29 2:05 p.m., MitchAlsup wrote:
Robert Finch <robfi680@gmail.com> posted:
I hard-coded an IRQ delay down-count in the Qupls4 core. The down-count
delays accepting interrupts for ten clock cycles or about 40
instructions if an interrupt got deferred. The interrupt being deferred
because interrupts got disabled by an instruction in the pipeline. I
guessed 40 instructions would likely be enough for many cases where IRQs >> are disabled then enabled again.
The issue is the pipeline is full of ISR instructions that should not be >> committed because the IRQs got disabled in the meantime. If the CPU were >> allowed to accept another IRQ right away, it could get stuck in a loop
flushing the pipeline and reloading with the ISR routine code instead of >> progressing through the code where IRQs were disabled.
The above is one of the reasons EricP supports the pipeline notion that interrupts do NOT flush the pipe. Instead, the instruction in the pipe
are allowed to retire (apace) and new instructions are inserted from
the interrupt service point.
That is how Qupls is working too. The issue is what happens when the instruction in the pipe before the ISR disables the interrupt. Then the
ISR instructions need to be flushed.
As long as the instructions "IN" the pipe
can deliver their results to their registers, and update µArchitectural state they "own", there is no reason to flush--AND--no corresponding
reason to delay "taking" the interrupt.
That is the usual case for Qupls too when there is an interrupt.
At the µArchitectural level, you, the designer, see both the front
and the end of the pipeline, you can change what goes in the front
and allow what was already in the pipe to come out the back. This
requires dragging a small amount of information down the pipe, much
like multi-threaded CPUs.
Yes, the IRQ info is being dragged down the pipe.
I could create a control register for this count and allow it to beMake the problem "go away". You will be happier in the end.
programmable. But I think that may not be necessary.
It is possible that 40 instructions is not enough. In that case the CPU
would advance in 40 instruction burps. Alternating between fetching ISR
instructions and the desired instruction stream. On the other hand, a
larger down-count starts to impact the IRQ latency.
Tradeoffs…
I suppose I could have the CPU increase the down-count if it is looping
around fetching ISR instructions. The down-count would be reset to the
minimum again once an interrupt enable instruction is executed.
Complex…
The interrupt mask is set at fetch time to disable lower priority interrupts. I suppose disabling of interrupts by the OS could simply be ignored. The interrupt could only be taken if it is a higher priority
than the current level.
I had thought the OS might have good reason to disable interrupts. But
maybe I am making things too complex.
Robert Finch wrote:
I hard-coded an IRQ delay down-count in the Qupls4 core. The down-count delays accepting interrupts for ten clock cycles or about 40
instructions if an interrupt got deferred. The interrupt being deferred because interrupts got disabled by an instruction in the pipeline. I guessed 40 instructions would likely be enough for many cases where IRQs are disabled then enabled again.
The issue is the pipeline is full of ISR instructions that should not be committed because the IRQs got disabled in the meantime. If the CPU were allowed to accept another IRQ right away, it could get stuck in a loop flushing the pipeline and reloading with the ISR routine code instead of progressing through the code where IRQs were disabled.
I could create a control register for this count and allow it to be programmable. But I think that may not be necessary.
It is possible that 40 instructions is not enough. In that case the CPU would advance in 40 instruction burps. Alternating between fetching ISR instructions and the desired instruction stream. On the other hand, a larger down-count starts to impact the IRQ latency.
Tradeoffs…
I suppose I could have the CPU increase the down-count if it is looping around fetching ISR instructions. The down-count would be reset to the minimum again once an interrupt enable instruction is executed.
Complex…
You are using this timer to predict the delay for draining the pipeline.
It would only take a read of a slow IO device register to exceed it.
I was thinking a simple and cheap way would be to use a variation of the single-step mechanism. An interrupt request would cause Decode to emit a special uOp with the single-step flag set and then stall, to allow the pipeline to drain the old stream before accepting the interrupt and redirecting Fetch to its handler. That way if there are and interrupt
enable or disable instructions, or branch mispredicts, or pending exceptions in-flight they all are allowed to finish and the state to settle down.
Pipelining interrupt delivery looks possible but gets complicated and expensive real quick.
Robert Finch wrote:
I hard-coded an IRQ delay down-count in the Qupls4 core. The down-
count delays accepting interrupts for ten clock cycles or about 40
instructions if an interrupt got deferred. The interrupt being
deferred because interrupts got disabled by an instruction in the
pipeline. I guessed 40 instructions would likely be enough for many
cases where IRQs are disabled then enabled again.
The issue is the pipeline is full of ISR instructions that should not
be committed because the IRQs got disabled in the meantime. If the CPU
were allowed to accept another IRQ right away, it could get stuck in a
loop flushing the pipeline and reloading with the ISR routine code
instead of progressing through the code where IRQs were disabled.
I could create a control register for this count and allow it to be
programmable. But I think that may not be necessary.
It is possible that 40 instructions is not enough. In that case the
CPU would advance in 40 instruction burps. Alternating between
fetching ISR instructions and the desired instruction stream. On the
other hand, a larger down-count starts to impact the IRQ latency.
Tradeoffs…
I suppose I could have the CPU increase the down-count if it is
looping around fetching ISR instructions. The down-count would be
reset to the minimum again once an interrupt enable instruction is
executed.
Complex…
You are using this timer to predict the delay for draining the pipeline.
It would only take a read of a slow IO device register to exceed it.
I was thinking a simple and cheap way would be to use a variation of the single-step mechanism. An interrupt request would cause Decode to emit a special uOp with the single-step flag set and then stall, to allow the pipeline to drain the old stream before accepting the interrupt and redirecting Fetch to its handler. That way if there are and interrupt
enable or disable instructions, or branch mispredicts, or pending
exceptions
in-flight they all are allowed to finish and the state to settle down.
Pipelining interrupt delivery looks possible but gets complicated and expensive real quick.
On 2025-11-29 4:10 p.m., EricP wrote:
Robert Finch wrote:
I hard-coded an IRQ delay down-count in the Qupls4 core. The down-
count delays accepting interrupts for ten clock cycles or about 40
instructions if an interrupt got deferred. The interrupt being
deferred because interrupts got disabled by an instruction in the
pipeline. I guessed 40 instructions would likely be enough for many
cases where IRQs are disabled then enabled again.
The issue is the pipeline is full of ISR instructions that should not
be committed because the IRQs got disabled in the meantime. If the CPU
were allowed to accept another IRQ right away, it could get stuck in a
loop flushing the pipeline and reloading with the ISR routine code
instead of progressing through the code where IRQs were disabled.
I could create a control register for this count and allow it to be
programmable. But I think that may not be necessary.
It is possible that 40 instructions is not enough. In that case the
CPU would advance in 40 instruction burps. Alternating between
fetching ISR instructions and the desired instruction stream. On the
other hand, a larger down-count starts to impact the IRQ latency.
Tradeoffs…
I suppose I could have the CPU increase the down-count if it is
looping around fetching ISR instructions. The down-count would be
reset to the minimum again once an interrupt enable instruction is
executed.
Complex…
You are using this timer to predict the delay for draining the pipeline.
It would only take a read of a slow IO device register to exceed it.
The down count is counting down only when the front-end of the pipeline advances, instructions are sure to be loaded.
I was thinking a simple and cheap way would be to use a variation of the single-step mechanism. An interrupt request would cause Decode to emit a special uOp with the single-step flag set and then stall, to allow the pipeline to drain the old stream before accepting the interrupt and redirecting Fetch to its handler. That way if there are and interrupt enable or disable instructions, or branch mispredicts, or pending exceptions
in-flight they all are allowed to finish and the state to settle down.
Pipelining interrupt delivery looks possible but gets complicated and expensive real quick.
The base down count increases every time the IRQ is found at the commit stage. If the base down count is too large (stuck interrupt) then an exception is processed. For instance if interrupts were disabled for
1000 clocks.
I think the mechanism could work, complicated though.
Treating the DI as an exception, as mentioned in another post would also work. It is a matter then of flushing the instructions between the DI
and ISR.
Thomas Koenig <tkoenig@netcologne.de> writes:
(Looking at your
code, it also does not seem to be self-sufficient, at least the
numerous SKIP4 statements require something else).
If you want to assemble the resulting .S file, it's assembled once
with
-DSKIP4= -Dgforth_engine2=gforth_engine
and once with
-DSKIP4=".skip 4"
(on Linux-GNU AMD64, the .skip assembler directive is autoconfigured
and may be different on other platforms).
My assumption is that the control flow is confusing gcc.
My guess is the same.
Robert Finch <robfi680@gmail.com> posted:
On 2025-11-29 4:10 p.m., EricP wrote:
Robert Finch wrote:The down count is counting down only when the front-end of the pipeline
I hard-coded an IRQ delay down-count in the Qupls4 core. The down-
count delays accepting interrupts for ten clock cycles or about 40
instructions if an interrupt got deferred. The interrupt being
deferred because interrupts got disabled by an instruction in the
pipeline. I guessed 40 instructions would likely be enough for many
cases where IRQs are disabled then enabled again.
The issue is the pipeline is full of ISR instructions that should not
be committed because the IRQs got disabled in the meantime. If the CPU >>>> were allowed to accept another IRQ right away, it could get stuck in a >>>> loop flushing the pipeline and reloading with the ISR routine code
instead of progressing through the code where IRQs were disabled.
I could create a control register for this count and allow it to be
programmable. But I think that may not be necessary.
It is possible that 40 instructions is not enough. In that case the
CPU would advance in 40 instruction burps. Alternating between
fetching ISR instructions and the desired instruction stream. On the
other hand, a larger down-count starts to impact the IRQ latency.
Tradeoffs…
I suppose I could have the CPU increase the down-count if it is
looping around fetching ISR instructions. The down-count would be
reset to the minimum again once an interrupt enable instruction is
executed.
Complex…
You are using this timer to predict the delay for draining the pipeline. >>> It would only take a read of a slow IO device register to exceed it.
advances, instructions are sure to be loaded.
I was thinking a simple and cheap way would be to use a variation of the >>> single-step mechanism. An interrupt request would cause Decode to emit a >>> special uOp with the single-step flag set and then stall, to allow theThe base down count increases every time the IRQ is found at the commit
pipeline to drain the old stream before accepting the interrupt and
redirecting Fetch to its handler. That way if there are and interrupt
enable or disable instructions, or branch mispredicts, or pending
exceptions
in-flight they all are allowed to finish and the state to settle down.
Pipelining interrupt delivery looks possible but gets complicated and
expensive real quick.
stage. If the base down count is too large (stuck interrupt) then an
exception is processed. For instance if interrupts were disabled for
1000 clocks.
I think the mechanism could work, complicated though.
Treating the DI as an exception, as mentioned in another post would also
work. It is a matter then of flushing the instructions between the DI
and ISR.
Which is no different than flushing instructions after a mispredicted branch.
Got fed up with trying to work out how get interrupts working. It turns
out to be more challenging than I expected, no matter which way it is
done. So, I decided to just poll for interrupts, getting rid of most of
the IRQ logic. I added a branch-on-interrupt BOI instruction that works almost the same way as every other branch. Then the micro-op translator
has been adapted to insert a polling branch periodically. It looks a lot simpler.
Robert Finch <robfi680@gmail.com> schrieb:
Got fed up with trying to work out how get interrupts working. It turns
out to be more challenging than I expected, no matter which way it is
done. So, I decided to just poll for interrupts, getting rid of most of
the IRQ logic. I added a branch-on-interrupt BOI instruction that works
almost the same way as every other branch. Then the micro-op translator
has been adapted to insert a polling branch periodically. It looks a lot
simpler.
What is the expected delay until an interrupt is delivered?
On 2025-11-30 5:10 a.m., Thomas Koenig wrote:
Robert Finch <robfi680@gmail.com> schrieb:
Got fed up with trying to work out how get interrupts working. It turns
out to be more challenging than I expected, no matter which way it is
done. So, I decided to just poll for interrupts, getting rid of most of
the IRQ logic. I added a branch-on-interrupt BOI instruction that works
almost the same way as every other branch. Then the micro-op translator
has been adapted to insert a polling branch periodically. It looks a lot >>> simpler.
What is the expected delay until an interrupt is delivered?
I set the timing to 16 clocks which is about 64 (or more) instructions.
Did not want to go much over 1% the number of instructions executed.
Not every instruction inserts a poll, so sometimes a poll is lacking.
IDK how well it will work. Making it an instruction means it might also
be used by software.
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Both our guesses were wrong, and Scott (I think) was on the right
track - this is a signed / unsigned issue. A reduced test case is
void bar(unsigned long, long);
void foo(unsigned long u1)
{
long u3;
u1 = u1 / 10;
u3 = u1 % 10;
bar(u1,u3);
}
This is now https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122911 .
100 labels in symbols. This may appear strange, but gcc generallytends to produce good code in relatively short time for Gforth (while
Thomas Koenig <tkoenig@netcologne.de> writes:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Both our guesses were wrong, and Scott (I think) was on the right
track - this is a signed / unsigned issue. A reduced test case is
void bar(unsigned long, long);
void foo(unsigned long u1)
{
long u3;
u1 = u1 / 10;
u3 = u1 % 10;
bar(u1,u3);
}
Assigning to u1 changed the meaning, as Andrew Pinski noted;
so the
jury is still out on what the actual problem is.
This is now https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122911 .
and a revised one at
<https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122919>
(The announced attachment is not there yet.)
The latter case is interesting, because real_ca and spc became global,
and symbols[] is still local, and no assignment to real_ca happens
inside foo().
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
scott@slp53.sl.home (Scott Lurndal) writes:
Thomas Koenig <tkoenig@netcologne.de> writes:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
I recently heard that CS graduates from ETH Zürich had heard about >>>>>>pipelines, but thought it was fetch-decode-execute.
Why would a CS graduate need to know about pipelines?
So they can properly simluate a pipelined processor?
Sure, if a CS graduate works in an application area, they need to
learn about that application area, whatever it is.
It's useful for code optimization, as well.
In general,
any programmer should have a solid understanding of the
underlying hardware - generically, and specifically
for the hardware being programmed.
Processor pipelines are not the basics of what a CS graduate is doing.
They are an implementation detail in computer engineering.
Which affect the performance of the software created by the
software engineer (CS graduate).
A few more examples where compilers are not as good as even I expected:
Just today, I compiled
u4 = u1/10;
u3 = u1%10;
(plus some surrounding code) with gcc-14 in three contexts. Here's
the code for two of them (the third one is similar to the second one):
movabs $0xcccccccccccccccd,%rax movabs $0xcccccccccccccccd,%rsi
sub $0x8,%r13 mov %r8,%rax
mul %r8 mov %r8,%rcx
mov %rdx,%rax mul %rsi
shr $0x3,%rax shr $0x3,%rdx
lea (%rax,%rax,4),%rdx lea (%rdx,%rdx,4),%rax
add %rdx,%rdx add %rax,%rax
sub %rdx,%r8 sub %rax,%r8
mov %r8,0x8(%r13) mov %rcx,%rax
mov %rax,%r8 mul %rsi
shr $0x3,%rdx
mov %rdx,%r9
The major difference is that in the left context, u3 is stored into
memory (at 0x8(%r13)), while in the right context, it stays in a
register. In the left context, gcc managed to base its computation of >>u1%10 on the result of u1/10; in the right context, gcc first computes >>u1%10 (computing u1/10 as part of that), and then computes u1/10
again.
Sort of emphasizes that programmers need to understand the
underlying hardware.
What were u1, u3 and u4 declared as?
In reducing compiler bugs, automated tools such as delta or
(much better) cvise are essential. Your test case was so
large that cvise failed, so a lot of manual work was required.
Thomas Koenig <tkoenig@netcologne.de> writes:
In reducing compiler bugs, automated tools such as delta or
(much better) cvise are essential. Your test case was so
large that cvise failed, so a lot of manual work was required.
I have now done a manual reduction myself; essentially I left only the
3 variants of the VM instruction that performs 10/, plus all the surroundings, and I added code to ensure that spTOS, spb, and spc are
not dead. You find the result at
http://www.complang.tuwien.ac.at/anton/tmp/engine-fast-red.i
ERROR "unexpected byte sequence starting at index 356: '\xC3'" while decoding:
scott@slp53.sl.home (Scott Lurndal) writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
scott@slp53.sl.home (Scott Lurndal) writes:
Thomas Koenig <tkoenig@netcologne.de> writes:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
I recently heard that CS graduates from ETH Zürich had heard about >>>>>>pipelines, but thought it was fetch-decode-execute.
Why would a CS graduate need to know about pipelines?
So they can properly simluate a pipelined processor?
Sure, if a CS graduate works in an application area, they need to
learn about that application area, whatever it is.
It's useful for code optimization, as well.
In what way?
In general,
any programmer should have a solid understanding of the
underlying hardware - generically, and specifically
for the hardware being programmed.
Certainly. But do they need to know between a a Wallace multiplier
and a Dadda multiplier?
If not, what is it about pipelined processors
that would require CS graduates to know about them?
Processor pipelines are not the basics of what a CS graduate is doing. >>They are an implementation detail in computer engineering.
Which affect the performance of the software created by the
software engineer (CS graduate).
By a constant factor; and the software creator does not need to know
that the CPU that executes instructions at 2 CPI (486) instead of at
10 CPI (VAX-11/780) is pipelined; and these days both the 486 and the
VAX are irrelevant to software creators.
A few more examples where compilers are not as good as even I expected:
Just today, I compiled
u4 = u1/10;
u3 = u1%10;
(plus some surrounding code) with gcc-14 in three contexts. Here's
the code for two of them (the third one is similar to the second one):
movabs $0xcccccccccccccccd,%rax movabs $0xcccccccccccccccd,%rsi
sub $0x8,%r13 mov %r8,%rax
mul %r8 mov %r8,%rcx
mov %rdx,%rax mul %rsi
shr $0x3,%rax shr $0x3,%rdx
lea (%rax,%rax,4),%rdx lea (%rdx,%rdx,4),%rax
add %rdx,%rdx add %rax,%rax
sub %rdx,%r8 sub %rax,%r8
mov %r8,0x8(%r13) mov %rcx,%rax
mov %rax,%r8 mul %rsi
shr $0x3,%rdx
mov %rdx,%r9
The major difference is that in the left context, u3 is stored into >>memory (at 0x8(%r13)), while in the right context, it stays in a >>register. In the left context, gcc managed to base its computation of >>u1%10 on the result of u1/10; in the right context, gcc first computes >>u1%10 (computing u1/10 as part of that), and then computes u1/10
again.
Sort of emphasizes that programmers need to understand the
underlying hardware.
I am the programmer of the code shown above. In what way would better knowledge of the hardware made me aware that gcc would produce
suboptimal code in some cases?
--- Synchronet 3.21a-Linux NewsLink 1.2What were u1, u3 and u4 declared as?
unsigned long (on that platform).
- anton
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
ERROR "unexpected byte sequence starting at index 356: '\xC3'" while decoding:
scott@slp53.sl.home (Scott Lurndal) writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
scott@slp53.sl.home (Scott Lurndal) writes:
Thomas Koenig <tkoenig@netcologne.de> writes:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
I recently heard that CS graduates from ETH Zürich had heard about >>>>>>>> pipelines, but thought it was fetch-decode-execute.
Why would a CS graduate need to know about pipelines?
So they can properly simluate a pipelined processor?
Sure, if a CS graduate works in an application area, they need to
learn about that application area, whatever it is.
It's useful for code optimization, as well.
In what way?
In general,
any programmer should have a solid understanding of the
underlying hardware - generically, and specifically
for the hardware being programmed.
Certainly. But do they need to know between a a Wallace multiplier
and a Dadda multiplier?
You do realize that all Wallace multipliers are Dadda multipliers ??
But there are Dadda Multipliers that are not Wallace multipliers ?!?!?!
If not, what is it about pipelined processors
that would require CS graduates to know about them?
How execution order disturbs things like program order and memory order.
That is how and when they need to insert Fences in their multi-threaded
code.
I am the programmer of the code shown above. In what way would better
knowledge of the hardware made me aware that gcc would produce
suboptimal code in some cases?
Reading and thinking about the asm-code and running the various code sequences enough times that you can measure which is better and which
is worse. That is the engineering part of software Engineering.
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
In reducing compiler bugs, automated tools such as delta or
(much better) cvise are essential. Your test case was so
large that cvise failed, so a lot of manual work was required.
I have now done a manual reduction myself; essentially I left only the
3 variants of the VM instruction that performs 10/, plus all the
surroundings, and I added code to ensure that spTOS, spb, and spc are
not dead. You find the result at
http://www.complang.tuwien.ac.at/anton/tmp/engine-fast-red.i
Do you have an example which tests the codepath taken for the
offending piece of code,
so it is possible to further reduce this
case automatically? The example is still quite big (>13000 lines).
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
ERROR "unexpected byte sequence starting at index 356: '\xC3'" while decoding:
scott@slp53.sl.home (Scott Lurndal) writes:
In general,
any programmer should have a solid understanding of the
underlying hardware - generically, and specifically
for the hardware being programmed.
Certainly. But do they need to know between a a Wallace multiplier
and a Dadda multiplier?
You do realize that all Wallace multipliers are Dadda multipliers ??
But there are Dadda Multipliers that are not Wallace multipliers ?!?!?!
If not, what is it about pipelined processors
that would require CS graduates to know about them?
How execution order disturbs things like program order and memory order.
That is how and when they need to insert Fences in their multi-threaded >code.
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
ERROR "unexpected byte sequence starting at index 356: '\xC3'" while decoding:
scott@slp53.sl.home (Scott Lurndal) writes:
In general,
any programmer should have a solid understanding of the
underlying hardware - generically, and specifically
for the hardware being programmed.
Certainly. But do they need to know between a a Wallace multiplier
and a Dadda multiplier?
You do realize that all Wallace multipliers are Dadda multipliers ??
But there are Dadda Multipliers that are not Wallace multipliers ?!?!?!
Good to know, but does not answer the question.
If not, what is it about pipelined processors
that would require CS graduates to know about them?
How execution order disturbs things like program order and memory order. >That is how and when they need to insert Fences in their multi-threaded >code.
And the relevance of pipelined processors for that issue is what?
Memory-ordering shenanigans come from the unholy alliance of
cache-coherent multiprocessing and the supercomputer attitude.
If you implement per-CPU caches and multiple memory controllers as shoddily
as possible while providing features for programs to slow themselves
down heavily in order to get memory-ordering guarantess, then you get
a weak memory model; slightly less shoddy, and you get a "strong" memory model. Processor pipelines have no relevance here.
And, as Niklas Holsti observed, dealing with memory-ordering
shenanigans is something that a few specialists do; no need for others
to know about the memory model, except that common CPUs unfortunately
do not implement sequential consistency.
- anton--- Synchronet 3.21a-Linux NewsLink 1.2
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Memory-ordering shenanigans come from the unholy alliance of
cache-coherent multiprocessing and the supercomputer attitude.
And without the SuperComputer attitude, you sell 0 parts.
{Remember how we talk about performance all the time here ?}
And only after several languages built their own ATOMIC primitives, so
the programmers could remain ignorant. But this also ties the hands of
the designers in such a way that performance grows ever more slowly
with more threads.
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Memory-ordering shenanigans come from the unholy alliance of
cache-coherent multiprocessing and the supercomputer attitude.
And without the SuperComputer attitude, you sell 0 parts.
{Remember how we talk about performance all the time here ?}
Wrong. The supercomputer attitude gave us such wonders as IA-64
(sells 0 parts) and Larrabee (sells 0 parts); why: because OoO is not
only easier to program, but also faster.
The advocates of weaker memory models justify them by pointing to the slowness of sequential consistency if one implements it by using
fences on hardware optimized for a weaker memory model. But that's
not the way to implement efficient sequential consistency.
In an alternate reality where AMD64 did not happen and IA-64 won,
people would justify the IA-64 ISA complexity as necessary for
performance, and claim that the IA-32 hardware in the Itanium
demonstrates the performance superiority of the EPIC approach, just
like they currently justify the performance superiority of weak and
"strong" memory models over sequential consistency.
If hardware designers put their mind to it, they could make sequential consistency perform well, probably better on code that actually
accesses data shared between different threads than weak and "strong" ordering, because there is no need to slow down the program with
fences and the like in cases where only one thread accesses the data,
and in cases where the data is read by all threads. You will see the slowdown only in run-time cases when one thread writes and another
reads in temporal proximity. And all the fences etc. that are
inserted just in case would also become fast (noops).
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Memory-ordering shenanigans come from the unholy alliance ofAnd without the SuperComputer attitude, you sell 0 parts.
cache-coherent multiprocessing and the supercomputer attitude.
{Remember how we talk about performance all the time here ?}
Wrong. The supercomputer attitude gave us such wonders as IA-64
(sells 0 parts) and Larrabee (sells 0 parts); why: because OoO is not
only easier to program, but also faster.
The advocates of weaker memory models justify them by pointing to the slowness of sequential consistency if one implements it by using
fences on hardware optimized for a weaker memory model. But that's
not the way to implement efficient sequential consistency.
In an alternate reality where AMD64 did not happen and IA-64 won,
people would justify the IA-64 ISA complexity as necessary for
performance, and claim that the IA-32 hardware in the Itanium
demonstrates the performance superiority of the EPIC approach, just
like they currently justify the performance superiority of weak and
"strong" memory models over sequential consistency.
If hardware designers put their mind to it, they could make sequential consistency perform well, probably better on code that actually
accesses data shared between different threads than weak and "strong" ordering, because there is no need to slow down the program with
fences and the like in cases where only one thread accesses the data,
and in cases where the data is read by all threads. You will see the slowdown only in run-time cases when one thread writes and another
reads in temporal proximity. And all the fences etc. that are
inserted just in case would also become fast (noops).
A similar case: Alpha includes a trapb instruction (an exception
fence). Programmers have to insert it after FP instructions to get
precise exceptions. This was justified with performance; i.e., the
theory went: If you compile without trapb, you get performance and
imprecise exceptions, if you compile with trapb, you get slowness and
precise exceptions. I then measured SPEC 95 compiled without and with
trapb <2003Apr3.202651@a0.complang.tuwien.ac.at>, and on the OoO 21264
there was hardly any difference; I believe that trapb is a noop on the
21264. Here's the SPECfp_base95 numbers:
with without
trapb trapb
9.56 11.6 AlphaPC164LX 600MHz 21164A
19.7 20.0 Compaq XP1000 500MHz 21264
So the machine that needs trapb is much slower even without trapb than
even the with-trapb variant on the machine where trapb is probably a
noop. And lots of implementations of architectures without trapb have demonstrated since then that you can have high performance and precise exceptions without trapb.
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Memory-ordering shenanigans come from the unholy alliance of
cache-coherent multiprocessing and the supercomputer attitude.
And without the SuperComputer attitude, you sell 0 parts.
{Remember how we talk about performance all the time here ?}
Wrong. The supercomputer attitude gave us such wonders as IA-64
(sells 0 parts) and Larrabee (sells 0 parts); why: because OoO is not
only easier to program, but also faster.
The advocates of weaker memory models justify them by pointing to the slowness of sequential consistency if one implements it by using
fences on hardware optimized for a weaker memory model. But that's
not the way to implement efficient sequential consistency.
In an alternate reality where AMD64 did not happen and IA-64 won,
people would justify the IA-64 ISA complexity as necessary for
performance, and claim that the IA-32 hardware in the Itanium
demonstrates the performance superiority of the EPIC approach, just
like they currently justify the performance superiority of weak and
"strong" memory models over sequential consistency.
If hardware designers put their mind to it, they could make sequential consistency perform well,
probably better on code that actually
accesses data shared between different threads than weak and "strong" ordering, because there is no need to slow down the program with
fences and the like in cases where only one thread accesses the data,
and in cases where the data is read by all threads. You will see the slowdown only in run-time cases when one thread writes and another
reads in temporal proximity. And all the fences etc. that are
inserted just in case would also become fast (noops).
A similar case: Alpha includes a trapb instruction (an exceptionmoderate slowdown
fence). Programmers have to insert it after FP instructions to get
precise exceptions. This was justified with performance; i.e., the
theory went: If you compile without trapb, you get performance and
imprecise exceptions, if you compile with trapb, you get slowness and
precise exceptions. I then measured SPEC 95 compiled without and with
trapb <2003Apr3.202651@a0.complang.tuwien.ac.at>, and on the OoO 21264
there was hardly any difference; I believe that trapb is a noop on the
21264. Here's the SPECfp_base95 numbers:
with without
trapb trapb
9.56 11.6 AlphaPC164LX 600MHz 21164A
19.7 20.0 Compaq XP1000 500MHz 21264slowdown has disappeared.
So the machine that needs trapb is much slower even without trapb than
even the with-trapb variant on the machine where trapb is probably a
noop. And lots of implementations of architectures without trapb have demonstrated since then that you can have high performance and precise exceptions without trapb.
And only after several languages built their own ATOMIC primitives, so
the programmers could remain ignorant. But this also ties the hands of
the designers in such a way that performance grows ever more slowly
with more threads.
Maybe they could free their hands by designing for a
sequential-consistency interface, just like designing for a simple sequential-execution model without EPIC features freed their hands to
design microarchitectural features that allowed ordinary code to
utilize wider and wider OoO cores profitably.
- anton--- Synchronet 3.21a-Linux NewsLink 1.2
Anton Ertl wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Memory-ordering shenanigans come from the unholy alliance ofAnd without the SuperComputer attitude, you sell 0 parts.
cache-coherent multiprocessing and the supercomputer attitude.
{Remember how we talk about performance all the time here ?}
Wrong. The supercomputer attitude gave us such wonders as IA-64
(sells 0 parts) and Larrabee (sells 0 parts); why: because OoO is not
only easier to program, but also faster.
The advocates of weaker memory models justify them by pointing to the slowness of sequential consistency if one implements it by using
fences on hardware optimized for a weaker memory model. But that's
not the way to implement efficient sequential consistency.
In an alternate reality where AMD64 did not happen and IA-64 won,
people would justify the IA-64 ISA complexity as necessary for
performance, and claim that the IA-32 hardware in the Itanium
demonstrates the performance superiority of the EPIC approach, just
like they currently justify the performance superiority of weak and "strong" memory models over sequential consistency.
If hardware designers put their mind to it, they could make sequential consistency perform well, probably better on code that actually
accesses data shared between different threads than weak and "strong" ordering, because there is no need to slow down the program with
fences and the like in cases where only one thread accesses the data,
and in cases where the data is read by all threads. You will see the slowdown only in run-time cases when one thread writes and another
reads in temporal proximity. And all the fences etc. that are
inserted just in case would also become fast (noops).
A similar case: Alpha includes a trapb instruction (an exception
fence). Programmers have to insert it after FP instructions to get
precise exceptions. This was justified with performance; i.e., the
theory went: If you compile without trapb, you get performance and imprecise exceptions, if you compile with trapb, you get slowness and precise exceptions. I then measured SPEC 95 compiled without and with trapb <2003Apr3.202651@a0.complang.tuwien.ac.at>, and on the OoO 21264 there was hardly any difference; I believe that trapb is a noop on the 21264. Here's the SPECfp_base95 numbers:
with without
trapb trapb
9.56 11.6 AlphaPC164LX 600MHz 21164A
19.7 20.0 Compaq XP1000 500MHz 21264
So the machine that needs trapb is much slower even without trapb than
even the with-trapb variant on the machine where trapb is probably a
noop. And lots of implementations of architectures without trapb have demonstrated since then that you can have high performance and precise exceptions without trapb.
The 21264 Hardware Reference Manual says TRAPB (general exception barrier) and EXCB (floating point control register barrier) are both NOP's
internally, are tossed at decode, and don't even take up an
instruction slot.
The purpose of the EXCB is to synchronize pipeline access to the
floating point control and status register with FP operations.
In the worst case this stalls until the pipeline drains.
I wonder how much logic it really saved allowing imprecise exceptions
in the InO 21064 and 21164?
Conversely, how much did it cost to deal
with problems caused by leaving these interlocks off?
The cores have multiple, parallel pipelines for int, lsq, fadd and fmul. Without exception interlocks, each pipeline only obeys the scoreboard
rules for when to writeback its result register: WAW and WAR.
That allows a younger, faster instruction to finish and write its register before an older, slower instruction. If that older instruction then throws
an exception and does not write its register then we can see the out of
order register writes.
For register file writes to be precise in the presence of exceptions
requires each instruction look ahead at the state of all older
instructions *in all pipelines*.
Each uOp can be Unresolved, Resolved_Normal, or Resolved_Exception.
A writeback can occur if there are no WAW or WAR dependencies,
and all older uOps are Resolved_Normal.
Just off the top of my head, in addition to the normal scoreboard,
a FIFO buffer with a priority selector could be used to look ahead
at all older uOps and check their status,
and allow or stall uOp
writebacks and ensure registers always appear precise.
Which really doesn't look that expensive.
Is there something I missed, or would that FIFO suffice?
The issue with this is it does not go backwards to
get the address fetched again from the TLB. Meaning no check is made for >protection or translation of the address.
It would be quite slow to have the instructions reissued and percolate
down the cache access again.
Semi-unaligned memory tradeoff. If unaligned access is required, the
memory logic just increments the physical address by 64 bytes to fetch
the next cache line. The issue with this is it does not go backwards to
get the address fetched again from the TLB. Meaning no check is made for protection or translation of the address.
It would be quite slow to have the instructions reissued and percolate
down the cache access again.
This should only be an issue if an unaligned access crosses a memory
page boundary.
The instruction causes an alignment fault if a page cross boundary is detected.
Robert Finch <robfi680@gmail.com> posted:
Semi-unaligned memory tradeoff. If unaligned access is required, the
memory logic just increments the physical address by 64 bytes to fetch
the next cache line. The issue with this is it does not go backwards to
get the address fetched again from the TLB. Meaning no check is made for
protection or translation of the address.
You can determine is an access is misaligned "enough" to warrant two
trips down the pipe.
a) crosses cache width
b) crosses page boundary
Case b ALLWAYS needs 2 trips; so the mechanism HAS to be there.
It would be quite slow to have the instructions reissued and percolate
down the cache access again.
An AGEN-like adder has 11-gates of delay, you can determine misaligned
by 4-gates of delay.
This should only be an issue if an unaligned access crosses a memory
page boundary.
Here you need to access the TLB twice.
The instruction causes an alignment fault if a page cross boundary is
detected.
probably not as wise as you think.
On Mon, 01 Dec 2025 07:56:37 GMT[snip]
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
If hardware designers put their mind to it, they could make sequential
consistency perform well, probably better on code that actually
accesses data shared between different threads than weak and "strong"
ordering, because there is no need to slow down the program with
fences and the like in cases where only one thread accesses the data,
and in cases where the data is read by all threads. You will see the
slowdown only in run-time cases when one thread writes and another
reads in temporal proximity. And all the fences etc. that are
inserted just in case would also become fast (noops).
Where does sequential consistency simplifies programming over x86 model
of "TCO + globally ordered synchronization primitives +
every synchronization primitives have implied barriers"?
More so, where it simplifies over ARMv8.1-A, assuming that programmer
does not try to be too smart and never uses LL/SC and always uses
8.1-style synchronization instructions with Acquire+Release flags set?
IMHO, the only simple thing about sequential consistency is simple >description. Other than that, it simplifies very little. It does not >magically make lockless multithreaded programming bearable to
non-genius coders.
In article <20251201132322.000051a5@yahoo.com>,
Michael S <already5chosen@yahoo.com> wrote:
On Mon, 01 Dec 2025 07:56:37 GMT[snip]
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
If hardware designers put their mind to it, they could make sequential
consistency perform well, probably better on code that actually
accesses data shared between different threads than weak and "strong"
ordering, because there is no need to slow down the program with
fences and the like in cases where only one thread accesses the data,
and in cases where the data is read by all threads. You will see the
slowdown only in run-time cases when one thread writes and another
reads in temporal proximity. And all the fences etc. that are
inserted just in case would also become fast (noops).
Where does sequential consistency simplifies programming over x86 model
of "TCO + globally ordered synchronization primitives +
every synchronization primitives have implied barriers"?
More so, where it simplifies over ARMv8.1-A, assuming that programmer
does not try to be too smart and never uses LL/SC and always uses
8.1-style synchronization instructions with Acquire+Release flags set?
IMHO, the only simple thing about sequential consistency is simple >description. Other than that, it simplifies very little. It does not >magically make lockless multithreaded programming bearable to
non-genius coders.
Compiler writers have hidden behind the hardware complexity to make
writing source code that is thread-safe much harder than it should be.
If you have to support placing hardware barriers, then the languages
can get away with needing lots of <atomic> qualifiers everywhere, even
on systems which don't need barriers, making the code more complex. And
language purists still love to sneer at volatile in C-like languages as "providing no guarantees, and so is essentially useless"--when volatile providing no guarantees is a language and compiler choice, not something written in stone.
A bunch of useful algorithms could be written with
merely "volatile" like semantics, but for some reason, people like the line-noise-like junk of C++ atomics, where rather than thinking in terms
of the algorithm, everyone needs to think in terms of release and acquire. (Which are weakly-ordering concepts).
Kent
kegs@provalid.com (Kent Dickey) posted:
Thread-safe, by definition, is (IS) harder.
language purists still love to sneer at volatile in C-like languages as
"providing no guarantees, and so is essentially useless"--when volatile
providing no guarantees is a language and compiler choice, not something
written in stone.
The problem with volatile is that all it means is the every time a volatile variable is touched, the code has to have a corresponding LD or ST. The HW ends up knowing nothing about the value's volativity and ends up in no position to help.
"volatile" /does/ provide guarantees - it just doesn't provide enough >guarantees for multi-threaded coding on multi-core systems. Basically,
it only works at the C abstract machine level - it does nothing that
affects the hardware. So volatile writes are ordered at the C level,
but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever.
David Brown <david.brown@hesbynett.no> writes:
"volatile" /does/ provide guarantees - it just doesn't provide enough
guarantees for multi-threaded coding on multi-core systems. Basically,
it only works at the C abstract machine level - it does nothing that
affects the hardware. So volatile writes are ordered at the C level,
but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever.
You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency".
If hardware guaranteed sequential consistency, volatile would provide guarantees that are as good on multi-core machines as on single-core machines.
However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems),
and I don't think that C with just volatile gives you such guarantees.
David Brown <david.brown@hesbynett.no> writes:
"volatile" /does/ provide guarantees - it just doesn't provide enough >guarantees for multi-threaded coding on multi-core systems. Basically,
it only works at the C abstract machine level - it does nothing that >affects the hardware. So volatile writes are ordered at the C level,
but that says nothing about how they might progress through storage >queues, caches, inter-processor communication buses, or whatever.
You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency".
If hardware guaranteed sequential consistency, volatile would provide guarantees that are as good on multi-core machines as on single-core machines.
However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems),
and I don't think that C with just volatile gives you such guarantees.--- Synchronet 3.21a-Linux NewsLink 1.2
- anton
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:
"volatile" /does/ provide guarantees - it just doesn't provide enough
guarantees for multi-threaded coding on multi-core systems. Basically,
it only works at the C abstract machine level - it does nothing that
affects the hardware. So volatile writes are ordered at the C level,
but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever.
You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency".
If hardware guaranteed sequential consistency, volatile would provide
guarantees that are as good on multi-core machines as on single-core
machines.
However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems),
Such as ????
and I don't think that C with just volatile gives you such guarantees.
- anton
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:
"volatile" /does/ provide guarantees - it just doesn't provide enough
guarantees for multi-threaded coding on multi-core systems. Basically, >>> it only works at the C abstract machine level - it does nothing that
affects the hardware. So volatile writes are ordered at the C level,
but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever.
You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency".
If hardware guaranteed sequential consistency, volatile would provide
guarantees that are as good on multi-core machines as on single-core
machines.
However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems),
Such as ????
Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.
Even with a single core system you can have pre-emptive multi-threading, or at least interrupt routines that may need to cooperate with other tasks on data.
and I don't think that C with just volatile gives you such guarantees.
- anton
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:
"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>> it only works at the C abstract machine level - it does nothing that >>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>> but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever.
You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency".
If hardware guaranteed sequential consistency, volatile would provide
guarantees that are as good on multi-core machines as on single-core
machines.
However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems),
Such as ????
Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.
My 66000 ISA can::
LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.
Compare Double, Swap Double::
BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}
Move Element from one place to another:
BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}
So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:
"volatile" /does/ provide guarantees - it just doesn't provide enough
guarantees for multi-threaded coding on multi-core systems. Basically, >>>> it only works at the C abstract machine level - it does nothing that
affects the hardware. So volatile writes are ordered at the C level, >>>> but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever.
You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency".
If hardware guaranteed sequential consistency, volatile would provide
guarantees that are as good on multi-core machines as on single-core
machines.
However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems),
Such as ????
Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.
Even with a
single core system you can have pre-emptive multi-threading, or at least interrupt routines that may need to cooperate with other tasks on data.
and I don't think that C with just volatile gives you such guarantees.
- anton
Tradeoffs bypassing r0 causing more ISA tweaks.
It is expensive to bypass r0. To truly bypass it, it needs to be
bypassed in a couple of dozen places which really drives up the LUT
count.
Robert Finch <robfi680@gmail.com> writes:
Tradeoffs bypassing r0 causing more ISA tweaks.
It is expensive to bypass r0. To truly bypass it, it needs to be
bypassed in a couple of dozen places which really drives up the LUT
count.
My impression is that modern implementations deal with this kind of
stuff at decoding or in the renamer. That should reduce the number of
places where it is special-cased to one, but it means that the uops
have to represent 0 in some way. One way would be to have a physical register that is 0 and that is never allocated, but if your
microarchitecture needs a reduction of actual read ports (compared to potential read ports), you may prefer a different representation of 0
in the uops.
- anton
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:
"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>> it only works at the C abstract machine level - it does nothing that >>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>> but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever.
You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency".
If hardware guaranteed sequential consistency, volatile would provide
guarantees that are as good on multi-core machines as on single-core
machines.
However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems),
Such as ????
Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.
My 66000 ISA can::
LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.
Compare Double, Swap Double::
BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}
Move Element from one place to another:
BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}
So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.
Even with a
single core system you can have pre-emptive multi-threading, or at least
interrupt routines that may need to cooperate with other tasks on data.
and I don't think that C with just volatile gives you such guarantees. >>>>
- anton
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:
"volatile" /does/ provide guarantees - it just doesn't provide enough
guarantees for multi-threaded coding on multi-core systems. Basically, >> >>> it only works at the C abstract machine level - it does nothing that
affects the hardware. So volatile writes are ordered at the C level,
but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever.
You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency".
If hardware guaranteed sequential consistency, volatile would provide
guarantees that are as good on multi-core machines as on single-core
machines.
However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems),
Such as ????
Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.
My 66000 ISA can::
LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.
Compare Double, Swap Double::
BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}
Move Element from one place to another:
BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}
So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.
On 12/5/2025 12:54 PM, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:Such as ????
"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>> it only works at the C abstract machine level - it does nothing that >>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>> but that says nothing about how they might progress through storage >>>>> queues, caches, inter-processor communication buses, or whatever.
You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering >>>> on hardware with weaker memory ordering than sequential consistency". >>>> If hardware guaranteed sequential consistency, volatile would provide >>>> guarantees that are as good on multi-core machines as on single-core >>>> machines.
However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems), >>>
Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.
My 66000 ISA can::
LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.
Compare Double, Swap Double::
BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}
Move Element from one place to another:
BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}
So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.
Any issues with live lock in here?
[...]--- Synchronet 3.21a-Linux NewsLink 1.2
Tradeoffs bypassing r0 causing more ISA tweaks.
It is expensive to bypass r0. To truly bypass it, it needs to be
bypassed in a couple of dozen places which really drives up the LUT
count. Removing the bypassing of r0 from the register file shaved 1000
LUTs off the design. This is no real loss as most instructions can substitute small constants for register values.
Decided to go PowerPC style with bypassing of r0 to zero. R0 is bypassed
to zero only in the agen units. So, the bypass is only in a couple of places. Otherwise r0 can be used as an ordinary register. Load / store instructions cannot use r0 as a GPR then, but it works for the PowerPC.
I hit this trying to decide where to bypass another register code to represent the instruction pointer. In that case I think it may be better
to go RISCV style and just add an instruction to add the IP to a
constant and place it in a register. The alternative might be to
sacrifice a bit of displacement to indicate IP relative addressing.
Anyone got a summary of bypassing r0 in different architectures?
Robert Finch <robfi680@gmail.com> writes:
Tradeoffs bypassing r0 causing more ISA tweaks.
It is expensive to bypass r0. To truly bypass it, it needs to be
bypassed in a couple of dozen places which really drives up the LUT
count.
My impression is that modern implementations deal with this kind of
stuff at decoding or in the renamer. That should reduce the number of
places where it is special-cased to one, but it means that the uops
have to represent 0 in some way. One way would be to have a physical register that is 0 and that is never allocated, but if your
microarchitecture needs a reduction of actual read ports (compared to potential read ports), you may prefer a different representation of 0
in the uops.
- anton--- Synchronet 3.21a-Linux NewsLink 1.2
On 05/12/2025 21:54, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:Such as ????
"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>> it only works at the C abstract machine level - it does nothing that >>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>> but that says nothing about how they might progress through storage >>>>> queues, caches, inter-processor communication buses, or whatever.
You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering >>>> on hardware with weaker memory ordering than sequential consistency". >>>> If hardware guaranteed sequential consistency, volatile would provide >>>> guarantees that are as good on multi-core machines as on single-core >>>> machines.
However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems), >>>
Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.
My 66000 ISA can::
LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.
The functions below rely on more than that - to make the work, as far as
I can see, you need the first "esmLOCKload" to lock the bus and also
lock the core from any kind of interrupt or other pre-emption, lasting
until the esmLOCKstore instruction. Or am I missing something here?
It is not easy to have atomic or lock mechanisms on multi-core systems
that are convenient to use, efficient even in the worst cases, and don't require additional hardware.
Compare Double, Swap Double::
BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}
Move Element from one place to another:
BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}
So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.
Of course. It is interesting to speculate about possible features of an architecture like yours, but it is not likely to be available to anyone
else in practice (unless perhaps it can be implemented as an extension
for RISC-V).
Even with a
single core system you can have pre-emptive multi-threading, or at least >> interrupt routines that may need to cooperate with other tasks on data.
and I don't think that C with just volatile gives you such guarantees. >>>>
- anton
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:Such as ????
"volatile" /does/ provide guarantees - it just doesn't provide enough >> >>> guarantees for multi-threaded coding on multi-core systems. Basically,
it only works at the C abstract machine level - it does nothing that >> >>> affects the hardware. So volatile writes are ordered at the C level, >> >>> but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever.
You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency". >> >> If hardware guaranteed sequential consistency, volatile would provide >> >> guarantees that are as good on multi-core machines as on single-core
machines.
However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems), >> >
Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.
My 66000 ISA can::
LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.
Compare Double, Swap Double::
BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}
Move Element from one place to another:
BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}
So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.
In my 40 years of SMP OS/HV work, I don't recall a
situation where 'MoveElement' would be useful or
required as an hardware atomic operation.
Individual atomic "Remove Element" and "Insert/Append Element"[*], yes. Combined? Too inflexible.
[*] For which atomic compare-and-swap or atomic swap is generally sufficient.--- Synchronet 3.21a-Linux NewsLink 1.2
Atomic add/sub are useful. The other atomic math operations (min, max, etc) may be useful in certain cases as well.
scott@slp53.sl.home (Scott Lurndal) posted:
Move Element from one place to another:
BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}
So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.
In my 40 years of SMP OS/HV work, I don't recall a
situation where 'MoveElement' would be useful or
required as an hardware atomic operation.
The question is not would "MoveElement" be useful, but
would it be useful to have a single ATOMIC event be
able to manipulate {5,6,7,8} pointers in one event ??
Individual atomic "Remove Element" and "Insert/Append Element"[*], yes.
Combined? Too inflexible.
BOOLEAN InsertElement( Element *el, Element *to )
{
tn = esmLOCKload( to->next );
esmLOCKprefetch( el );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
el->next = tn;
el->prev = to;
to->next = el;
esmLOCKstore( tn->prev, el );
return TRUE;
}
return FALSE;
}
BOOLEAN RemoveElement( Element *fr )
{
fn = esmLOCKload( fr->next );
fp = esmLOCKload( fr->prev );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
fr->prev = NULL;
esmLOCKstore( fr->next, NULL );
return TRUE;
}
return FALSE;
}
[*] For which atomic compare-and-swap or atomic swap is generally sufficient.
Yes, you can add special instructions. However, the compilers will be unlikely
to generate them, thus applications that desired the generation of such an instruction would need to create a compiler extension (like gcc __builtin functions)
or inline assembler which would then make the program that uses the capability both compiler
specific _and_ hardware specific.
Most extant SMP processors provide a compare and swap operation, which
are widely supported by the common compilers that support the C and C++ threading functionality.
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
scott@slp53.sl.home (Scott Lurndal) posted:
Move Element from one place to another:
BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}
So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.
In my 40 years of SMP OS/HV work, I don't recall a
situation where 'MoveElement' would be useful or
required as an hardware atomic operation.
The question is not would "MoveElement" be useful, but
would it be useful to have a single ATOMIC event be
able to manipulate {5,6,7,8} pointers in one event ??
Nothing comes immediately to mind.
Individual atomic "Remove Element" and "Insert/Append Element"[*], yes.
Combined? Too inflexible.
BOOLEAN InsertElement( Element *el, Element *to )
{
tn = esmLOCKload( to->next );
esmLOCKprefetch( el );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
el->next = tn;
el->prev = to;
to->next = el;
esmLOCKstore( tn->prev, el );
return TRUE;
}
return FALSE;
}
BOOLEAN RemoveElement( Element *fr )
{
fn = esmLOCKload( fr->next );
fp = esmLOCKload( fr->prev );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
fr->prev = NULL;
esmLOCKstore( fr->next, NULL );
return TRUE;
}
return FALSE;
}
[*] For which atomic compare-and-swap or atomic swap is generally sufficient.
Yes, you can add special instructions. However, the compilers will be unlikely
to generate them, thus applications that desired the generation of such an instruction would need to create a compiler extension (like gcc __builtin functions)
or inline assembler which would then make the program that uses the capability both compiler
specific _and_ hardware specific.
Most extant SMP processors provide a compare and swap operation, which--- Synchronet 3.21a-Linux NewsLink 1.2
are widely supported by the common compilers that support the C and C++ threading functionality.
Robert Finch <robfi680@gmail.com> posted:
Tradeoffs bypassing r0 causing more ISA tweaks.
It is expensive to bypass r0. To truly bypass it, it needs to be
bypassed in a couple of dozen places which really drives up the LUT
count. Removing the bypassing of r0 from the register file shaved 1000
LUTs off the design. This is no real loss as most instructions can
substitute small constants for register values.
Often the use of R0 as an operand causes the calculation to be degenerate. That is, R0 is not needed at all.
ADD R9,R7,R0 // is a MOV instruction
AND R9,R7,R0 // is a CLR instruction
So, you don't have to treat R0 in bypassing, but as Operand processing.
Decided to go PowerPC style with bypassing of r0 to zero. R0 is bypassed
to zero only in the agen units. So, the bypass is only in a couple of
places. Otherwise r0 can be used as an ordinary register. Load / store
instructions cannot use r0 as a GPR then, but it works for the PowerPC.
AGEN Rbase ==R0 implies Rbase = IP
AGEN Rindex==R0 implies Rindex = 0
I hit this trying to decide where to bypass another register code to
represent the instruction pointer. In that case I think it may be better
to go RISCV style and just add an instruction to add the IP to a
constant and place it in a register. The alternative might be to
sacrifice a bit of displacement to indicate IP relative addressing.
Anyone got a summary of bypassing r0 in different architectures?
These are some of the reasons I went with
a) universal constants
b) R0 is just another GPR
So, R0, gets forwarded just as often (or lack thereof) as any joe-random register.
On 2025-12-06 12:29 p.m., MitchAlsup wrote:
We dont want no degenerating instructions.
Robert Finch <robfi680@gmail.com> posted:
Tradeoffs bypassing r0 causing more ISA tweaks.
It is expensive to bypass r0. To truly bypass it, it needs to be
bypassed in a couple of dozen places which really drives up the LUT
count. Removing the bypassing of r0 from the register file shaved 1000
LUTs off the design. This is no real loss as most instructions can
substitute small constants for register values.
Often the use of R0 as an operand causes the calculation to be
degenerate.
That is, R0 is not needed at all.
ADD R9,R7,R0 // is a MOV instruction
AND R9,R7,R0 // is a CLR instruction
So, you don't have to treat R0 in bypassing, but as Operand processing.Qupls now follows a similar paradigm.
Decided to go PowerPC style with bypassing of r0 to zero. R0 is bypassed >>> to zero only in the agen units. So, the bypass is only in a couple of
places. Otherwise r0 can be used as an ordinary register. Load / store
instructions cannot use r0 as a GPR then, but it works for the PowerPC.
AGEN Rbase ==R0 implies Rbase = IP
AGEN Rindex==R0 implies Rindex = 0
Rbase = r0 bypasses to 0
Rindex = r0 bypasses to 0
Rbase = r31 bypasses to IP
Bypassing r0 for both base and index allows absolute addressing mode. Otherwise r0, r31 are general-purpose regs.
I hit this trying to decide where to bypass another register code to
represent the instruction pointer. In that case I think it may be better >>> to go RISCV style and just add an instruction to add the IP to a
constant and place it in a register. The alternative might be to
sacrifice a bit of displacement to indicate IP relative addressing.
Anyone got a summary of bypassing r0 in different architectures?
These are some of the reasons I went with
a) universal constants
b) R0 is just another GPR
So, R0, gets forwarded just as often (or lack thereof) as any joe-random
register.
Qupls has IP offset constant loading.
On 2025-12-06 6:33 p.m., Robert Finch wrote:
On 2025-12-06 12:29 p.m., MitchAlsup wrote:
We dont want no degenerating instructions.
Robert Finch <robfi680@gmail.com> posted:
Tradeoffs bypassing r0 causing more ISA tweaks.
It is expensive to bypass r0. To truly bypass it, it needs to be
bypassed in a couple of dozen places which really drives up the LUT
count. Removing the bypassing of r0 from the register file shaved 1000 >>> LUTs off the design. This is no real loss as most instructions can
substitute small constants for register values.
Often the use of R0 as an operand causes the calculation to be
degenerate.
That is, R0 is not needed at all.
ADD R9,R7,R0 // is a MOV instruction
AND R9,R7,R0 // is a CLR instruction
So, you don't have to treat R0 in bypassing, but as Operand processing. >>> Decided to go PowerPC style with bypassing of r0 to zero. R0 is bypassed >>> to zero only in the agen units. So, the bypass is only in a couple ofQupls now follows a similar paradigm.
places. Otherwise r0 can be used as an ordinary register. Load / store >>> instructions cannot use r0 as a GPR then, but it works for the PowerPC. >>AGEN Rbase ==R0 implies Rbase = IP
AGEN Rindex==R0 implies Rindex = 0
Rbase = r0 bypasses to 0
Rindex = r0 bypasses to 0
Rbase = r31 bypasses to IP
Bypassing r0 for both base and index allows absolute addressing mode. Otherwise r0, r31 are general-purpose regs.
I hit this trying to decide where to bypass another register code to
represent the instruction pointer. In that case I think it may be better >>> to go RISCV style and just add an instruction to add the IP to a
constant and place it in a register. The alternative might be to
sacrifice a bit of displacement to indicate IP relative addressing.
Anyone got a summary of bypassing r0 in different architectures?
These are some of the reasons I went with
a) universal constants
b) R0 is just another GPR
So, R0, gets forwarded just as often (or lack thereof) as any joe-random >> register.
Qupls has IP offset constant loading.
No sooner than having updated the spec, I added two more opcodes to
perform loads and stores using IP relative addressing. That way, no need
to use r31, leaving 31 registers completely general purpose. I am
wanting to cast some aspects of the ISA in stone, or it will never get anywhere.
Yes, you can add special instructions. However, the compilers will be unlikely
to generate them, thus applications that desired the generation of such an instruction would need to create a compiler extension (like gcc __builtin functions)
or inline assembler which would then make the program that uses the capability both compiler
specific _and_ hardware specific.
Most extant SMP processors provide a compare and swap operation, which
are widely supported by the common compilers that support the C and C++ threading functionality.
Scott Lurndal <scott@slp53.sl.home> schrieb:
Yes, you can add special instructions. However, the compilers
will be unlikely to generate them, thus applications that desired
the generation of such an instruction would need to create a
compiler extension (like gcc __builtin functions) or inline
assembler which would then make the program that uses the
capability both compiler specific _and_ hardware specific.
This would likely be hidden in a header, and need only be
written once (although gcc and clang, for example, are compatible
in this respecct). And people have been doing this, even for microarchitecture specific features, if the need for performance
gain is large enough.
A primary example is Intel TSX, which is (was?) required by SAP.
POWER also had a transactional memory feature, but they messed it
up for POWER 9 and dropped it for POWER 10 (IIRC); POWER is the
only other architecture certified to run SAP, so it seems they
can do without.
Googling around, I also find the "Transactional Memory Extension"
for ARM datetd 2022, so ARM also appears to see some value in that,
at least enough to write a spec for it.
Most extant SMP processors provide a compare and swap operation,
which are widely supported by the common compilers that support the
C and C++ threading functionality.
It seems there is a market for going beyond compare and swap.
scott@slp53.sl.home (Scott Lurndal) posted:
Yes, you can add special instructions. However, the compilers will be unlikely
to generate them, thus applications that desired the generation of such an >> instruction would need to create a compiler extension (like gcc __builtin functions)
or inline assembler which would then make the program that uses the capability both compiler
specific _and_ hardware specific.
So, in other words, if you can't put it in every ISA known to man,
don't bother making something better than existent ?!?
Scott Lurndal <scott@slp53.sl.home> schrieb:
Yes, you can add special instructions. However, the compilers will be unlikely
to generate them, thus applications that desired the generation of such an >> instruction would need to create a compiler extension (like gcc __builtin functions)
or inline assembler which would then make the program that uses the capability both compiler
specific _and_ hardware specific.
This would likely be hidden in a header, and need only be
written once (although gcc and clang, for example, are compatible
in this respecct). And people have been doing this, even for >microarchitecture specific features, if the need for performance
gain is large enough.
A primary example is Intel TSX, which is (was?) required by SAP.
POWER also had a transactional memory feature, but they messed it
up for POWER 9 and dropped it for POWER 10 (IIRC); POWER is the
only other architecture certified to run SAP, so it seems they
can do without.
Googling around, I also find the "Transactional Memory Extension"
for ARM datetd 2022, so ARM also appears to see some value in that,
at least enough to write a spec for it.
On Sun, 7 Dec 2025 09:30:50 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
Scott Lurndal <scott@slp53.sl.home> schrieb:
Yes, you can add special instructions. However, the compilers
will be unlikely to generate them, thus applications that desired
the generation of such an instruction would need to create a
compiler extension (like gcc __builtin functions) or inline
assembler which would then make the program that uses the
capability both compiler specific _and_ hardware specific.
This would likely be hidden in a header, and need only be
written once (although gcc and clang, for example, are compatible
in this respecct). And people have been doing this, even for
microarchitecture specific features, if the need for performance
gain is large enough.
A primary example is Intel TSX, which is (was?) required by SAP.
By SAP HANA, I assume.
Not sure for how long it was true. It sounds very unlikely that it is
still true.
POWER also had a transactional memory feature, but they messed it
up for POWER 9 and dropped it for POWER 10 (IIRC); POWER is the
only other architecture certified to run SAP, so it seems they
can do without.
Googling around, I also find the "Transactional Memory Extension"
for ARM datetd 2022, so ARM also appears to see some value in that,
at least enough to write a spec for it.
Most extant SMP processors provide a compare and swap operation,
which are widely supported by the common compilers that support the
C and C++ threading functionality.
It seems there is a market for going beyond compare and swap.
TSX is close to dead.
ARM's TME was announced almost 5 years ago. AFAIK, there were no implementations. Recently ARM said that FEAT_TME is obsoleted. It sounds
like the whole thing is dead, but there is small chance that I am misinterpreting.
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
scott@slp53.sl.home (Scott Lurndal) posted:
In my 40 years of SMP OS/HV work, I don't recall aThe question is not would "MoveElement" be useful, but
situation where 'MoveElement' would be useful or
required as an hardware atomic operation.
would it be useful to have a single ATOMIC event be
able to manipulate {5,6,7,8} pointers in one event ??
Nothing comes immediately to mind.
Where does sequential consistency simplifies programming over x86 model
of "TCO + globally ordered synchronization primitives +
every synchronization primitives have implied barriers"?
More so, where it simplifies over ARMv8.1-A, assuming that programmer
does not try to be too smart and never uses LL/SC and always uses
8.1-style synchronization instructions with Acquire+Release flags set?
IMHO, the only simple thing about sequential consistency is simple >description. Other than that, it simplifies very little. It does not >magically make lockless multithreaded programming bearable to
non-genius coders.
On 12/5/2025 11:10 AM, David Brown wrote:
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:
"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>> guarantees for multi-threaded coding on multi-core systems.
Basically,
it only works at the C abstract machine level - it does nothing that >>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>> but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever.
You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency".
If hardware guaranteed sequential consistency, volatile would provide
guarantees that are as good on multi-core machines as on single-core
machines.
However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems),
Such as ????
Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.
It's strange that double-word compare and swap (DWCAS), where the words
are contiguous. Well, I have seen compilers say its not lock-free even
on a x86. for a 32 bit system we have cmpxchg8b. For a 64 bit system cmpxchg16b. But the compiler reports not lock free. Strange.
using cmpxchg instead of xadd: https://forum.pellesc.de/index.php?topic=7167.0
trying to tell me that a DWCAS is not lock free: https://forum.pellesc.de/index.php?topic=7311.msg27764#msg27764
This should be lock-free on an x86, even x64:
struct ct_proxy_dwcas
{
struct ct_proxy_node* node;
intptr_t count;
};
some of my older code:
AC_SYS_APIEXPORT
int AC_CDECL
np_ac_i686_atomic_dwcas_fence
( void*,
void*,
const void* );
np_ac_i686_atomic_dwcas_fence PROC
push esi
push ebx
mov esi, [esp + 16]
mov eax, [esi]
mov edx, [esi + 4]
mov esi, [esp + 20]
mov ebx, [esi]
mov ecx, [esi + 4]
mov esi, [esp + 12]
lock cmpxchg8b qword ptr [esi]
jne np_ac_i686_atomic_dwcas_fence_fail
xor eax, eax
pop ebx
pop esi
ret
np_ac_i686_atomic_dwcas_fence_fail:
mov esi, [esp + 16]
mov [esi + 0], eax;
mov [esi + 4], edx;
mov eax, 1
pop ebx
pop esi
ret
np_ac_i686_atomic_dwcas_fence ENDP
Even with a single core system you can have pre-emptive multi-
threading, or at least interrupt routines that may need to cooperate
with other tasks on data.
and I don't think that C with just volatile gives you such guarantees. >>>>
- anton
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/5/2025 12:54 PM, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:Such as ????
"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>>>> it only works at the C abstract machine level - it does nothing that >>>>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>>>> but that says nothing about how they might progress through storage >>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>You describe in many words and not really to the point what can be >>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>> on hardware with weaker memory ordering than sequential consistency". >>>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>>> guarantees that are as good on multi-core machines as on single-core >>>>>> machines.
However, for concurrent manipulations of data structures, one wants >>>>>> atomic operations beyond load and store (even on single-core systems), >>>>>
Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.
My 66000 ISA can::
LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.
Compare Double, Swap Double::
BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}
Move Element from one place to another:
BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}
So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.
Any issues with live lock in here?
A bit hard to tell because of 2 things::
a) I carry around the thread priority and when interference occurs,
the higher priority thread wins--ties the already accessed thread wins. b) live-lock is resolved or not by the caller to these routines, not
these routines themselves.
On 05/12/2025 21:54, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:Such as ????
"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>> guarantees for multi-threaded coding on multi-core systems.
Basically,
it only works at the C abstract machine level - it does nothing that >>>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>>> but that says nothing about how they might progress through storage >>>>>> queues, caches, inter-processor communication buses, or whatever.
You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering >>>>> on hardware with weaker memory ordering than sequential consistency". >>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>> guarantees that are as good on multi-core machines as on single-core >>>>> machines.
However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems), >>>>
Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.
My 66000 ISA can::
LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.
The functions below rely on more than that - to make the work, as far as
I can see, you need the first "esmLOCKload" to lock the bus and also
lock the core from any kind of interrupt or other pre-emption, lasting
until the esmLOCKstore instruction. Or am I missing something here?
Scott Lurndal <scott@slp53.sl.home> schrieb:
Yes, you can add special instructions. However, the compilers will be unlikely
to generate them, thus applications that desired the generation of such an >> instruction would need to create a compiler extension (like gcc __builtin functions)
or inline assembler which would then make the program that uses the capability both compiler
specific _and_ hardware specific.
Most extant SMP processors provide a compare and swap operation, which
are widely supported by the common compilers that support the C and C++
threading functionality.
Interestingly, Linux restartable sequences allow for acquisition of
a lock with no membarrier or atomic instruction on the fast path,
at the cost of a syscall on the slow path (no free lunch...)
But you also need assembler to do it.
An example is, for example, at https://gitlab.ethz.ch/extra_projects/cpu-local-lock
scott@slp53.sl.home (Scott Lurndal) posted:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:Such as ????
"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>>>> it only works at the C abstract machine level - it does nothing that >>>>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>>>> but that says nothing about how they might progress through storage >>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>You describe in many words and not really to the point what can be >>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>> on hardware with weaker memory ordering than sequential consistency". >>>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>>> guarantees that are as good on multi-core machines as on single-core >>>>>> machines.
However, for concurrent manipulations of data structures, one wants >>>>>> atomic operations beyond load and store (even on single-core systems), >>>>>
Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.
My 66000 ISA can::
LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.
Compare Double, Swap Double::
BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}
Move Element from one place to another:
BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}
So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.
In my 40 years of SMP OS/HV work, I don't recall a
situation where 'MoveElement' would be useful or
required as an hardware atomic operation.
The question is not would "MoveElement" be useful, but
would it be useful to have a single ATOMIC event be
able to manipulate {5,6,7,8} pointers in one event ??
Individual atomic "Remove Element" and "Insert/Append Element"[*], yes.
Combined? Too inflexible.
BOOLEAN InsertElement( Element *el, Element *to )
{
tn = esmLOCKload( to->next );
esmLOCKprefetch( el );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
el->next = tn;
el->prev = to;
to->next = el;
esmLOCKstore( tn->prev, el );
return TRUE;
}
return FALSE;
}
BOOLEAN RemoveElement( Element *fr )
{
fn = esmLOCKload( fr->next );
fp = esmLOCKload( fr->prev );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
fr->prev = NULL;
esmLOCKstore( fr->next, NULL );
return TRUE;
}
return FALSE;
}
[*] For which atomic compare-and-swap or atomic swap is generally sufficient.
Atomic add/sub are useful. The other atomic math operations (min, max, etc) >> may be useful in certain cases as well.
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 21:54, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:Such as ????
"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>>>> it only works at the C abstract machine level - it does nothing that >>>>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>>>> but that says nothing about how they might progress through storage >>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>You describe in many words and not really to the point what can be >>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>> on hardware with weaker memory ordering than sequential consistency". >>>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>>> guarantees that are as good on multi-core machines as on single-core >>>>>> machines.
However, for concurrent manipulations of data structures, one wants >>>>>> atomic operations beyond load and store (even on single-core systems), >>>>>
Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.
My 66000 ISA can::
LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.
The functions below rely on more than that - to make the work, as far as
I can see, you need the first "esmLOCKload" to lock the bus and also
lock the core from any kind of interrupt or other pre-emption, lasting
until the esmLOCKstore instruction. Or am I missing something here?
In the above, I was stating that the maximum width of LD/ST can be a lot bigger than the size of a single register, not that the above instructions make writing ATOMIC events easier.
These is no bus!
The esmLOCKload causes the <translated> address to be 'monitored'
for interference, and to announce participation in the ATOMIC event.
The FIRST esmLOCKload tells the core that an ATOMIC event is beginning,
AND sets up a default control point (This instruction itself) so that
if interference is detected at esmLOCKstore control is transferred to
that control point.
So, there is no way to write Test-and-Set !! you get Test-and-Test-and-Set for free.
There is a branch-on-interference instruction that
a) does what it says,
b) sets up an alternate atomic control point.
It is not easy to have atomic or lock mechanisms on multi-core systems
that are convenient to use, efficient even in the worst cases, and don't
require additional hardware.
I am using the "Miss Buffer" as the point of monitoring for interference.
a) it already has to monitor "other hits" from outside accesses to deal
with the coherence mechanism.
b) that esm additions to Miss Buffer are on the order of 2%
c) there are other means to strengthen guarantees of forward progress.
Compare Double, Swap Double::
BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}
Move Element from one place to another:
BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}
So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.
Of course. It is interesting to speculate about possible features of an
architecture like yours, but it is not likely to be available to anyone
else in practice (unless perhaps it can be implemented as an extension
for RISC-V).
Even with a >>>> single core system you can have pre-emptive multi-threading, or at least >>>> interrupt routines that may need to cooperate with other tasks on data. >>>>
and I don't think that C with just volatile gives you such guarantees. >>>>>>
- anton
On 12/6/2025 5:42 AM, David Brown wrote:
On 05/12/2025 21:54, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:
"volatile" /does/ provide guarantees - it just doesn't provideYou describe in many words and not really to the point what can be >>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>> on hardware with weaker memory ordering than sequential consistency". >>>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>>> guarantees that are as good on multi-core machines as on single-core >>>>>> machines.
enough
guarantees for multi-threaded coding on multi-core systems.
Basically,
it only works at the C abstract machine level - it does nothing that >>>>>>> affects the hardware. So volatile writes are ordered at the C >>>>>>> level,
but that says nothing about how they might progress through storage >>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>
However, for concurrent manipulations of data structures, one wants >>>>>> atomic operations beyond load and store (even on single-core
systems),
Such as ????
Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.
My 66000 ISA can::
LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction. >>>
The functions below rely on more than that - to make the work, as far
as I can see, you need the first "esmLOCKload" to lock the bus and
also lock the core from any kind of interrupt or other pre-emption,
lasting until the esmLOCKstore instruction. Or am I missing something
here?
Lock the BUS? Only when shit hits the fan. What about locking the cache line? Actually, I think we can "force" an x86/x64 to lock the bus if we
do a LOCK'ed RMW on memory that straddles cache lines?
BOOLEAN RemoveElement( Element *fr )
{
fn = esmLOCKload( fr->next );
fp = esmLOCKload( fr->prev );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
fr->prev = NULL;
esmLOCKstore( fr->next, NULL );
return TRUE;
}
return FALSE;
}
[*] For which atomic compare-and-swap or atomic swap is generally sufficient.
Yes, you can add special instructions. However, the compilers will be unlikely
to generate them, thus applications that desired the generation of such an >> instruction would need to create a compiler extension (like gcc __builtin functions)
or inline assembler which would then make the program that uses the capability both compiler
specific _and_ hardware specific.
So, in other words, if you can't put it in every ISA known to man,
don't bother making something better than existent ?!?
Most extant SMP processors provide a compare and swap operation, which
are widely supported by the common compilers that support the C and C++
threading functionality.
On 08/12/2025 00:17, Chris M. Thomasson wrote:
On 12/6/2025 5:42 AM, David Brown wrote:
On 05/12/2025 21:54, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:
"volatile" /does/ provide guarantees - it just doesn't provide >>>>>>>> enoughYou describe in many words and not really to the point what can be >>>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>>> on hardware with weaker memory ordering than sequential
guarantees for multi-threaded coding on multi-core systems.
Basically,
it only works at the C abstract machine level - it does nothing >>>>>>>> that
affects the hardware. So volatile writes are ordered at the C >>>>>>>> level,
but that says nothing about how they might progress through storage >>>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>>
consistency".
If hardware guaranteed sequential consistency, volatile would
provide
guarantees that are as good on multi-core machines as on single-core >>>>>>> machines.
However, for concurrent manipulations of data structures, one wants >>>>>>> atomic operations beyond load and store (even on single-core
systems),
Such as ????
Atomic increment, compare-and-swap, locks, loads and stores of sizes >>>>> bigger than the maximum load/store size of the processor.
My 66000 ISA can::
LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction. >>>> MM can MOV up to 8192 bytes as a single ATOMIC instruction. >>>>
The functions below rely on more than that - to make the work, as far
as I can see, you need the first "esmLOCKload" to lock the bus and
also lock the core from any kind of interrupt or other pre-emption,
lasting until the esmLOCKstore instruction. Or am I missing
something here?
Lock the BUS? Only when shit hits the fan. What about locking the
cache line? Actually, I think we can "force" an x86/x64 to lock the
bus if we do a LOCK'ed RMW on memory that straddles cache lines?
Yes, I meant "lock the bus" - but I might have been overcautious.
However, it seems there is a hidden hardware loop here - the
esmLOCKstore instruction can fail and and the processor jumps back to
the first esmLOCKload instruction. With that, you don't need to block other code from running or accessing the bus.
<snip>
BOOLEAN RemoveElement( Element *fr )
{
fn = esmLOCKload( fr->next );
fp = esmLOCKload( fr->prev );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
fr->prev = NULL;
esmLOCKstore( fr->next, NULL );
return TRUE;
}
return FALSE;
}
[*] For which atomic compare-and-swap or atomic swap is generally
sufficient.
Yes, you can add special instructions. However, the compilers will
be unlikely
to generate them, thus applications that desired the generation of
such an
instruction would need to create a compiler extension (like gcc
__builtin functions)
or inline assembler which would then make the program that uses the
capability both compiler
specific _and_ hardware specific.
So, in other words, if you can't put it in every ISA known to man,
don't bother making something better than existent ?!?
Most extant SMP processors provide a compare and swap operation, which
are widely supported by the common compilers that support the C and C++
threading functionality.
I am having trouble understanding how the block of code in the esmINTERFERENCE() block is protected so that the whole thing executes as
a unit. It would seem to me that the address range(s) needing to be
locked would have to be supplied throughout the system, including across buffers and bus bridges. It would have to go to the memory coherence
point. Otherwise, some other device using a bridge could update the same address range in the middle of an update.
I am assuming the esmLockStore() just unlocks what was previously locked
and the stores have already happened by that time.
On 12/8/2025 4:25 AM, Robert Finch wrote:
<snip>
I am having trouble understanding how the block of code in the
esmINTERFERENCE() block is protected so that the whole thing executes as
a unit. It would seem to me that the address range(s) needing to be
locked would have to be supplied throughout the system, including across
buffers and bus bridges. It would have to go to the memory coherence
point. Otherwise, some other device using a bridge could update the same
address range in the middle of an update.
I may be wrong about this, but I think you have a misconception. The
ESM doesn't *prevent* interference, but it *detect* interference. Thus >nothing is required of other cores, no locks, etc. If they write to a >"protected" location, the write is allowed, but the core in the ESM is >notified, so it can redo the ESM protected code.
On 12/6/2025 5:42 AM, David Brown wrote:
On 05/12/2025 21:54, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:Such as ????
"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>> guarantees for multi-threaded coding on multi-core systems.You describe in many words and not really to the point what can be >>>>> explained concisely as: "volatile says nothing about memory ordering >>>>> on hardware with weaker memory ordering than sequential consistency". >>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>> guarantees that are as good on multi-core machines as on single-core >>>>> machines.
Basically,
it only works at the C abstract machine level - it does nothing that >>>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>>> but that says nothing about how they might progress through storage >>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>
However, for concurrent manipulations of data structures, one wants >>>>> atomic operations beyond load and store (even on single-core systems), >>>>
Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.
My 66000 ISA can::
LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction. >>
The functions below rely on more than that - to make the work, as far as
I can see, you need the first "esmLOCKload" to lock the bus and also
lock the core from any kind of interrupt or other pre-emption, lasting until the esmLOCKstore instruction. Or am I missing something here?
Lock the BUS? Only when shit hits the fan. What about locking the cache line? Actually, I think we can "force" an x86/x64 to lock the bus if we
do a LOCK'ed RMW on memory that straddles cache lines?
[...]
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/6/2025 5:42 AM, David Brown wrote:
On 05/12/2025 21:54, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:Such as ????
"volatile" /does/ provide guarantees - it just doesn't provide enough >> >>>>>> guarantees for multi-threaded coding on multi-core systems.
Basically,
it only works at the C abstract machine level - it does nothing that >> >>>>>> affects the hardware. So volatile writes are ordered at the C level,
but that says nothing about how they might progress through storage >> >>>>>> queues, caches, inter-processor communication buses, or whatever.
You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering >> >>>>> on hardware with weaker memory ordering than sequential consistency". >> >>>>> If hardware guaranteed sequential consistency, volatile would provide >> >>>>> guarantees that are as good on multi-core machines as on single-core >> >>>>> machines.
However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems), >> >>>>
Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.
My 66000 ISA can::
LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction. >> >>
The functions below rely on more than that - to make the work, as far as >> > I can see, you need the first "esmLOCKload" to lock the bus and also
lock the core from any kind of interrupt or other pre-emption, lasting
until the esmLOCKstore instruction. Or am I missing something here?
Lock the BUS? Only when shit hits the fan. What about locking the cache
line? Actually, I think we can "force" an x86/x64 to lock the bus if we
do a LOCK'ed RMW on memory that straddles cache lines?
In the My 66000 case, Mem References can lock up to 8 cache lines.
On 06/12/2025 18:44, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 21:54, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:Such as ????
"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>>> guarantees for multi-threaded coding on multi-core systems. Basically,You describe in many words and not really to the point what can be >>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>> on hardware with weaker memory ordering than sequential consistency". >>>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>>> guarantees that are as good on multi-core machines as on single-core >>>>>> machines.
it only works at the C abstract machine level - it does nothing that >>>>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>>>> but that says nothing about how they might progress through storage >>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>
However, for concurrent manipulations of data structures, one wants >>>>>> atomic operations beyond load and store (even on single-core systems), >>>>>
Atomic increment, compare-and-swap, locks, loads and stores of sizes >>>> bigger than the maximum load/store size of the processor.
My 66000 ISA can::
LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.
The functions below rely on more than that - to make the work, as far as >> I can see, you need the first "esmLOCKload" to lock the bus and also
lock the core from any kind of interrupt or other pre-emption, lasting
until the esmLOCKstore instruction. Or am I missing something here?
In the above, I was stating that the maximum width of LD/ST can be a lot bigger than the size of a single register, not that the above instructions make writing ATOMIC events easier.
That's what I assumed.
Certainly there are situations where it can be helpful to have longer
atomic reads and writes. I am not so sure about allowing 8 KB atomic accesses, especially in a system with multiple cores - that sounds like letting user programs DoS everything else on the system.
These is no bus!
I think there's a typo or some missing words there?
The esmLOCKload causes the <translated> address to be 'monitored'
for interference, and to announce participation in the ATOMIC event.
The FIRST esmLOCKload tells the core that an ATOMIC event is beginning,
AND sets up a default control point (This instruction itself) so that
if interference is detected at esmLOCKstore control is transferred to
that control point.
So, there is no way to write Test-and-Set !! you get Test-and-Test-and-Set for free.
If I understand you correctly here, you basically have a "load-reserve / store-conditional" sequence as commonly found in RISC architectures, but
you have the associated loop built into the hardware?
I can see that potentially improving efficiency, but I also find it very difficult to
read or write C code that has hidden loops. And I worry about how it
would all work if another thread on the same core or a different core
was running similar code in the middle of these sequences. It also
reduces the flexibility - in some use-cases, you want to have software limits on the number of attempts of a lr/sc loop to detect serious synchronisation problems.
There is a branch-on-interference instruction that
a) does what it says,
b) sets up an alternate atomic control point.
It is not easy to have atomic or lock mechanisms on multi-core systems
that are convenient to use, efficient even in the worst cases, and don't >> require additional hardware.
I am using the "Miss Buffer" as the point of monitoring for interference. a) it already has to monitor "other hits" from outside accesses to deal
with the coherence mechanism.
b) that esm additions to Miss Buffer are on the order of 2%
c) there are other means to strengthen guarantees of forward progress.
Compare Double, Swap Double::
BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}
Move Element from one place to another:
BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}
So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.
Of course. It is interesting to speculate about possible features of an >> architecture like yours, but it is not likely to be available to anyone
else in practice (unless perhaps it can be implemented as an extension
for RISC-V).
Even with a >>>> single core system you can have pre-emptive multi-threading, or at least >>>> interrupt routines that may need to cooperate with other tasks on data. >>>>
and I don't think that C with just volatile gives you such guarantees. >>>>>>
- anton
<snip>
BOOLEAN RemoveElement( Element *fr )
{
fn = esmLOCKload( fr->next );
fp = esmLOCKload( fr->prev );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
fr->prev = NULL;
esmLOCKstore( fr->next, NULL );
return TRUE;
}
return FALSE;
}
[*] For which atomic compare-and-swap or atomic swap is generally sufficient.
Yes, you can add special instructions. However, the compilers will be unlikely
to generate them, thus applications that desired the generation of such an >> instruction would need to create a compiler extension (like gcc __builtin functions)
or inline assembler which would then make the program that uses the capability both compiler
specific _and_ hardware specific.
So, in other words, if you can't put it in every ISA known to man,
don't bother making something better than existent ?!?
Most extant SMP processors provide a compare and swap operation, which
are widely supported by the common compilers that support the C and C++
threading functionality.
I am having trouble understanding how the block of code in the esmINTERFERENCE() block is protected so that the whole thing executes as
a unit. It would seem to me that the address range(s) needing to be
locked would have to be supplied throughout the system, including across buffers and bus bridges. It would have to go to the memory coherence
point. Otherwise, some other device using a bridge could update the same address range in the middle of an update.
I am assuming the esmLockStore() just unlocks what was previously locked
and the stores have already happened by that time.
It would seem that esmINTERFERENCE() would indicate that everybody with access out to the coherence point has agreed to the locked area? Does
that require that all devices respect the esmINTERFERENCE()?
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 12/8/2025 4:25 AM, Robert Finch wrote:
<snip>
I am having trouble understanding how the block of code in the
esmINTERFERENCE() block is protected so that the whole thing executes as >> a unit. It would seem to me that the address range(s) needing to be
locked would have to be supplied throughout the system, including across >> buffers and bus bridges. It would have to go to the memory coherence
point. Otherwise, some other device using a bridge could update the same >> address range in the middle of an update.
I may be wrong about this, but I think you have a misconception. The
ESM doesn't *prevent* interference, but it *detect* interference. Thus >nothing is required of other cores, no locks, etc. If they write to a >"protected" location, the write is allowed, but the core in the ESM is >notified, so it can redo the ESM protected code.
Sounds very much similar to the ARMv8 concept of an "exclusive monitor"
(the basis of the Store-Exclusive/Load-Exclusive instructions, which
mirror the LL/SC paradigm). The ARMv8 monitors an implementation defined range surrounding the target address and the store will fail if any other agent has modified any byte within the exclusive range.
esmINTERFERENCE seems to require multiple of these exclusive blocks
to cover non-contiguous address ranges, which on first blush leads
me to worry both about deadlock situations and starvation issues.
ERROR "unexpected byte sequence starting at index 736: '\xC2'" while decoding:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/6/2025 5:42 AM, David Brown wrote:
On 05/12/2025 21:54, MitchAlsup wrote:Lock the BUS? Only when shit hits the fan. What about locking the cache >> line? Actually, I think we can "force" an x86/x64 to lock the bus if we >> do a LOCK'ed RMW on memory that straddles cache lines?
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:
"volatile" /does/ provide guarantees - it just doesn't provide enoughYou describe in many words and not really to the point what can be >> >>>>> explained concisely as: "volatile says nothing about memory ordering >> >>>>> on hardware with weaker memory ordering than sequential consistency".
guarantees for multi-threaded coding on multi-core systems.
Basically,
it only works at the C abstract machine level - it does nothing that
affects the hardware. So volatile writes are ordered at the C level,
but that says nothing about how they might progress through storage >> >>>>>> queues, caches, inter-processor communication buses, or whatever. >> >>>>>
If hardware guaranteed sequential consistency, volatile would provide
guarantees that are as good on multi-core machines as on single-core >> >>>>> machines.
However, for concurrent manipulations of data structures, one wants >> >>>>> atomic operations beyond load and store (even on single-core systems),
Such as ????
Atomic increment, compare-and-swap, locks, loads and stores of sizes >> >>> bigger than the maximum load/store size of the processor.
My 66000 ISA can::
LDM/STM can LD/ST up to 32  DWs  as a single ATOMIC instruction.
MMÂ Â Â Â Â can MOVÂ Â up to 8192 bytes as a single ATOMIC instruction.
The functions below rely on more than that - to make the work, as far as
I can see, you need the first "esmLOCKload" to lock the bus and also
lock the core from any kind of interrupt or other pre-emption, lasting >> > until the esmLOCKstore instruction. Or am I missing something here? >>
In the My 66000 case, Mem References can lock up to 8 cache lines.
What if two processors have intersecting (but not fully overlapping)
sets of those 8 cache lines?
Can you guarantee forward progress?
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 12/8/2025 4:25 AM, Robert Finch wrote:
<snip>
I am having trouble understanding how the block of code in the
esmINTERFERENCE() block is protected so that the whole thing executes as >>> a unit. It would seem to me that the address range(s) needing to be
locked would have to be supplied throughout the system, including across >>> buffers and bus bridges. It would have to go to the memory coherence
point. Otherwise, some other device using a bridge could update the same >>> address range in the middle of an update.
I may be wrong about this, but I think you have a misconception. The
ESM doesn't *prevent* interference, but it *detect* interference. Thus
nothing is required of other cores, no locks, etc. If they write to a
"protected" location, the write is allowed, but the core in the ESM is
notified, so it can redo the ESM protected code.
Sounds very much similar to the ARMv8 concept of an "exclusive monitor"
(the basis of the Store-Exclusive/Load-Exclusive instructions, which
mirror the LL/SC paradigm). The ARMv8 monitors an implementation defined range surrounding the target address and the store will fail if any other agent has modified any byte within the exclusive range.
esmINTERFERENCE seems to require multiple of these exclusive blocks
to cover non-contiguous address ranges, which on first blush leads
me to worry both about deadlock situations and starvation issues.
On 12/8/2025 4:25 AM, Robert Finch wrote:
<snip>
I am having trouble understanding how the block of code in the
esmINTERFERENCE() block is protected so that the whole thing executes
as a unit. It would seem to me that the address range(s) needing to be
locked would have to be supplied throughout the system, including
across buffers and bus bridges. It would have to go to the memory
coherence point. Otherwise, some other device using a bridge could
update the same address range in the middle of an update.
I may be wrong about this, but I think you have a misconception. The
ESM doesn't *prevent* interference, but it *detect* interference. Thus nothing is required of other cores, no locks, etc. If they write to a "protected" location, the write is allowed, but the core in the ESM is notified, so it can redo the ESM protected code.
I am assuming the esmLockStore() just unlocks what was previously
locked and the stores have already happened by that time.
There is no "locking" in the sense of preventing any accesses.
On 08/12/2025 17:23, Stephen Fuld wrote:
On 12/8/2025 4:25 AM, Robert Finch wrote:
<snip>
I am having trouble understanding how the block of code in the
esmINTERFERENCE() block is protected so that the whole thing executes
as a unit. It would seem to me that the address range(s) needing to be
locked would have to be supplied throughout the system, including
across buffers and bus bridges. It would have to go to the memory
coherence point. Otherwise, some other device using a bridge could
update the same address range in the middle of an update.
I may be wrong about this, but I think you have a misconception. The
ESM doesn't *prevent* interference, but it *detect* interference. Thus nothing is required of other cores, no locks, etc. If they write to a "protected" location, the write is allowed, but the core in the ESM is notified, so it can redo the ESM protected code.
Yes, that is correct (as far as I understand it now). The critical part---------------------------------
is the hidden hardware loop that was not mentioned or indicated in the original code.
There are basically two ways to handle atomic operations. One way is to
use locking mechanisms to ensure that nothing (other cores, interrupts
or other pre-emption on the same core) can break up the sequence. The
other way is to have a mechanism to detect conflicts and a failure of
the atomic operation, so that you can try again (or otherwise handle the situation). (You can, of course, combine these - such as by disabling
local interrupts and detecting conflicts from other cores.)
The code Mitch posted apparently had neither of these mechanisms, hence
my confusion. It turns out that it /does/ have conflict detection and a hardware retry loop, all hidden from anyone trying to understand the
code. (I can appreciate that there may be benefits in doing this in hardware, but there are no benefits in hiding it from the programmer!)
David Brown <david.brown@hesbynett.no> posted:
There are basically two ways to handle atomic operations. One way is to
use locking mechanisms to ensure that nothing (other cores, interrupts
or other pre-emption on the same core) can break up the sequence. The
other way is to have a mechanism to detect conflicts and a failure of
the atomic operation, so that you can try again (or otherwise handle the
situation). (You can, of course, combine these - such as by disabling
local interrupts and detecting conflicts from other cores.)
The code Mitch posted apparently had neither of these mechanisms, hence
my confusion. It turns out that it /does/ have conflict detection and a
hardware retry loop, all hidden from anyone trying to understand the
code. (I can appreciate that there may be benefits in doing this in
hardware, but there are no benefits in hiding it from the programmer!)
How exactly do you inform the programmer that:
InBound [Address]
OutBound [Address]
operates like::
try_again:
InBound [Address]
BIN try_again
OutBound [Address]
And why clutter up asm with extraneous labels and require extra instructions.
On 09/12/2025 20:15, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
There are basically two ways to handle atomic operations. One way is to >> use locking mechanisms to ensure that nothing (other cores, interrupts
or other pre-emption on the same core) can break up the sequence. The
other way is to have a mechanism to detect conflicts and a failure of
the atomic operation, so that you can try again (or otherwise handle the >> situation). (You can, of course, combine these - such as by disabling
local interrupts and detecting conflicts from other cores.)
The code Mitch posted apparently had neither of these mechanisms, hence
my confusion. It turns out that it /does/ have conflict detection and a >> hardware retry loop, all hidden from anyone trying to understand the
code. (I can appreciate that there may be benefits in doing this in
hardware, but there are no benefits in hiding it from the programmer!)
How exactly do you inform the programmer that:
InBound [Address]
OutBound [Address]
operates like::
try_again:
InBound [Address]
BIN try_again
OutBound [Address]
And why clutter up asm with extraneous labels and require extra instructions.
The most obvious answer is that in any code that uses these features,
good comments are essential so that readers can see what is happening.
Another method would be to use better names for the intrinsics, as seen
at the C (or other HLL) level. (Assembly instruction names don't matter nearly as much.)
So maybe instead of "esmLOCKload()" and "esmLOCKstore()" you have "load_and_set_retry_point()" and "store_or_retry()". Feel free to think
of better names, but that would at least give the reader a clue that
there's something odd going on.
Mostly esm detects interference but there are times when esm is allowed
to ignore interference.
Consider a sever scale esm implementation. In such an implementation, esm
is enhanced with a system* arbiter.
After any successful ATOMIC event esm reverts to "Optimistic" mode. In optimistic mode, esm races through the code as fast as possible expecting
no interference. When interference is detected, the event fails and a HW counter is incremented. The failure diverts control to the ATOMIC control point. We still have the property that all participating memory locations become visible at the same instant.>
At this point the core is in "careful" mode,
core becomes sequentially
consistent, SW chooses to re-run the event. Here, cache misses leave
core in program order,... When interference is detected, the event fails
and that HW counter is incremented. Failure diverts control to the ATOMIC control point, no participating memory is seen to have been modified.
If core can determine that all writes to participating memory can be performed (at the first participating store) core is allowed to NaK
lower priority interfering accesses.
On 12/9/2025 11:15 AM, MitchAlsup wrote:
snip
Mostly esm detects interference but there are times when esm is allowed
to ignore interference.
Consider a sever scale esm implementation. In such an implementation, esm is enhanced with a system* arbiter.
After any successful ATOMIC event esm reverts to "Optimistic" mode. In optimistic mode, esm races through the code as fast as possible expecting no interference. When interference is detected, the event fails and a HW counter is incremented. The failure diverts control to the ATOMIC control point. We still have the property that all participating memory locations become visible at the same instant.>
At this point the core is in "careful" mode,
I am missing some understanding here, about this "counter". This
paragraph seems to indicate that after one failure, the core goes into "careful" mode, but if that were true, you wouldn't need a "counter",
just a mode bit. So assuming it is a counter and you need "n" failures
in a row to go into careful mode, is "n" hardwired or settable by
software? What are the tradeoffs for smaller or larger values of "n"?
core becomes sequentially
consistent, SW chooses to re-run the event. Here, cache misses leave
core in program order,... When interference is detected, the event fails and that HW counter is incremented. Failure diverts control to the ATOMIC control point, no participating memory is seen to have been modified.
If core can determine that all writes to participating memory can be performed (at the first participating store) core is allowed to NaK
lower priority interfering accesses.
Again, after a single failure in careful mode or n failures? If n, is
it the same value of n as for the transition from optimistic to careful mode? Same questions as before about who sets the value and is it
software changeable?
David Brown <david.brown@hesbynett.no> posted:
On 09/12/2025 20:15, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
There are basically two ways to handle atomic operations. One way is to >>>> use locking mechanisms to ensure that nothing (other cores, interrupts >>>> or other pre-emption on the same core) can break up the sequence. The >>>> other way is to have a mechanism to detect conflicts and a failure of
the atomic operation, so that you can try again (or otherwise handle the >>>> situation). (You can, of course, combine these - such as by disabling >>>> local interrupts and detecting conflicts from other cores.)
The code Mitch posted apparently had neither of these mechanisms, hence >>>> my confusion. It turns out that it /does/ have conflict detection and a >>>> hardware retry loop, all hidden from anyone trying to understand the
code. (I can appreciate that there may be benefits in doing this in
hardware, but there are no benefits in hiding it from the programmer!)
How exactly do you inform the programmer that:
InBound [Address]
OutBound [Address]
operates like::
try_again:
InBound [Address]
BIN try_again
OutBound [Address]
And why clutter up asm with extraneous labels and require extra instructions.
The most obvious answer is that in any code that uses these features,
good comments are essential so that readers can see what is happening.
Another method would be to use better names for the intrinsics, as seen
at the C (or other HLL) level. (Assembly instruction names don't matter
nearly as much.)
So maybe instead of "esmLOCKload()" and "esmLOCKstore()" you have
"load_and_set_retry_point()" and "store_or_retry()". Feel free to think
of better names, but that would at least give the reader a clue that
there's something odd going on.
This is a useful suggestion; thanks.
On the other hand, there are some non-vonNeumann actions lurking within
esm. Where vonNeumann means: that every instruction is executed in its entirety before the next instruction appears to start executing.
1st:: one cannot single step through an ATMOIC event, if you enter an
ATOMIC event in single-step mode, you will see the 1st instruction in
the event, than you will receive control after the terminal instruction
has executed.
2nd::the only way to debug an event is to have a buffer of SW locations
that gets written with non-participating STs. Unlike participating
memory lines, these locations will be written--but not in a sequentially consistent manner (architecturally), and can be examined outside the
event; whereas the participating lines are either all written instan- taneously or not modified at all.
So, here we have non-participating STs having been written and older participating STs have not.
3rd:: control transfer not under SW control--more like exceptions and interrupts than Br-condition--except that the target of control transfer
is based on the code in the event.
4th:: one cannot test esm with a random code generator, since the probability that the random code generator creates a legal esm event is exceedingly low.
On 09/12/2025 22:28, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
On 09/12/2025 20:15, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
There are basically two ways to handle atomic operations. One way >>>>> is toHow exactly do you inform the programmer that:
use locking mechanisms to ensure that nothing (other cores, interrupts >>>>> or other pre-emption on the same core) can break up the sequence. The >>>>> other way is to have a mechanism to detect conflicts and a failure of >>>>> the atomic operation, so that you can try again (or otherwise
handle the
situation). (You can, of course, combine these - such as by disabling >>>>> local interrupts and detecting conflicts from other cores.)
The code Mitch posted apparently had neither of these mechanisms,
hence
my confusion. It turns out that it /does/ have conflict detection >>>>> and a
hardware retry loop, all hidden from anyone trying to understand the >>>>> code. (I can appreciate that there may be benefits in doing this in >>>>> hardware, but there are no benefits in hiding it from the programmer!) >>>>
InBound [Address]
OutBound [Address]
operates like::
try_again:
InBound [Address]
BIN try_again
OutBound [Address]
And why clutter up asm with extraneous labels and require extra
instructions.
The most obvious answer is that in any code that uses these features,
good comments are essential so that readers can see what is happening.
Another method would be to use better names for the intrinsics, as seen
at the C (or other HLL) level. (Assembly instruction names don't matter >>> nearly as much.)
So maybe instead of "esmLOCKload()" and "esmLOCKstore()" you have
"load_and_set_retry_point()" and "store_or_retry()". Feel free to think >>> of better names, but that would at least give the reader a clue that
there's something odd going on.
This is a useful suggestion; thanks.
I can certainly say they would help /me/ understand the code, so maybe
they would help other people understand it too.
On the other hand, there are some non-vonNeumann actions lurking within
esm. Where vonNeumann means: that every instruction is executed in its
entirety before the next instruction appears to start executing.
That's a rather different use of the term "vonNeumann" from anything I
have heard. I'd just talk about "indivisible" instructions (avoiding "atomic", because that usually refers to a wider view of the system).
And are we thinking about the instructions purely from the viewpoint of
the cpu executing them?
IME, most instructions on most processors are indivisible, but most processors have some instructions that are not. For example, processors can have load/store multiple instructions that are interruptable - in
some cases, after returning from the interrupt (and any associated
thread context switches) the instructions are restarted, in other cases
they are continued.
But most instructions /appear/ to be executed entirely before the next instruction /appears/ to start executing. Fast processors have a lot of hardware designed to keep up this appearance - register renaming, pipelining, speculative execution, dependency tracking, and all the rest
of it.
1st:: one cannot single step through an ATMOIC event, if you enter an
ATOMIC event in single-step mode, you will see the 1st instruction in
the event, than you will receive control after the terminal instruction
has executed.
That is presumably a choice you made for the debugging features of the device.
2nd::the only way to debug an event is to have a buffer of SW locations
that gets written with non-participating STs. Unlike participating
memory lines, these locations will be written--but not in a sequentially
consistent manner (architecturally), and can be examined outside the
event; whereas the participating lines are either all written instan-
taneously or not modified at all.
So, here we have non-participating STs having been written and older
participating STs have not.
3rd:: control transfer not under SW control--more like exceptions and
interrupts than Br-condition--except that the target of control transfer
is based on the code in the event.
OK. I can see the advantages of that - though there are disadvantages
too (such as being unable to control a limit on the number of retries,
or add SW tracking of retry counts for metrics).
My main concern was
the disconnect between how the code was written and what it actually does.
4th:: one cannot test esm with a random code generator, since the
probability
that the random code generator creates a legal esm event is
exceedingly low.
Testing and debugging any kind of locking or atomic access solution is always very difficult.
You can rarely try out conflicts or potential
race conditions in the lab - they only ever turn up at customer demos!
On 09/12/2025 22:28, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
On 09/12/2025 20:15, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
There are basically two ways to handle atomic operations. One way is to >>>> use locking mechanisms to ensure that nothing (other cores, interrupts >>>> or other pre-emption on the same core) can break up the sequence. The >>>> other way is to have a mechanism to detect conflicts and a failure of >>>> the atomic operation, so that you can try again (or otherwise handle the >>>> situation). (You can, of course, combine these - such as by disabling >>>> local interrupts and detecting conflicts from other cores.)How exactly do you inform the programmer that:
The code Mitch posted apparently had neither of these mechanisms, hence >>>> my confusion. It turns out that it /does/ have conflict detection and a >>>> hardware retry loop, all hidden from anyone trying to understand the >>>> code. (I can appreciate that there may be benefits in doing this in >>>> hardware, but there are no benefits in hiding it from the programmer!) >>>
InBound [Address]
OutBound [Address]
operates like::
try_again:
InBound [Address]
BIN try_again
OutBound [Address]
And why clutter up asm with extraneous labels and require extra instructions.
The most obvious answer is that in any code that uses these features,
good comments are essential so that readers can see what is happening.
Another method would be to use better names for the intrinsics, as seen
at the C (or other HLL) level. (Assembly instruction names don't matter >> nearly as much.)
So maybe instead of "esmLOCKload()" and "esmLOCKstore()" you have
"load_and_set_retry_point()" and "store_or_retry()". Feel free to think >> of better names, but that would at least give the reader a clue that
there's something odd going on.
This is a useful suggestion; thanks.
I can certainly say they would help /me/ understand the code, so maybe
they would help other people understand it too.
On the other hand, there are some non-vonNeumann actions lurking within esm. Where vonNeumann means: that every instruction is executed in its entirety before the next instruction appears to start executing.
That's a rather different use of the term "vonNeumann" from anything I
have heard. I'd just talk about "indivisible" instructions (avoiding "atomic", because that usually refers to a wider view of the system).
And are we thinking about the instructions purely from the viewpoint of
the cpu executing them?
IME, most instructions on most processors are indivisible, but most processors have some instructions that are not. For example, processors
can have load/store multiple instructions that are interruptable - in
some cases, after returning from the interrupt (and any associated
thread context switches) the instructions are restarted, in other cases
they are continued.
But most instructions /appear/ to be executed entirely before the next instruction /appears/ to start executing. Fast processors have a lot of hardware designed to keep up this appearance - register renaming, pipelining, speculative execution, dependency tracking, and all the rest
of it.
1st:: one cannot single step through an ATMOIC event, if you enter an ATOMIC event in single-step mode, you will see the 1st instruction in
the event, than you will receive control after the terminal instruction
has executed.
That is presumably a choice you made for the debugging features of the device.
2nd::the only way to debug an event is to have a buffer of SW locations that gets written with non-participating STs. Unlike participating
memory lines, these locations will be written--but not in a sequentially consistent manner (architecturally), and can be examined outside the
event; whereas the participating lines are either all written instan- taneously or not modified at all.
So, here we have non-participating STs having been written and older participating STs have not.
3rd:: control transfer not under SW control--more like exceptions and interrupts than Br-condition--except that the target of control transfer
is based on the code in the event.
OK. I can see the advantages of that - though there are disadvantages
too (such as being unable to control a limit on the number of retries,
or add SW tracking of retry counts for metrics).
My main concern was
the disconnect between how the code was written and what it actually does.
4th:: one cannot test esm with a random code generator, since the probability
that the random code generator creates a legal esm event is exceedingly low.
Testing and debugging any kind of locking or atomic access solution is always very difficult. You can rarely try out conflicts or potential
race conditions in the lab - they only ever turn up at customer demos!
David Brown <david.brown@hesbynett.no> posted:
OK. I can see the advantages of that - though there are disadvantages
too (such as being unable to control a limit on the number of retries,
or add SW tracking of retry counts for metrics).
esm attempts to allow SW to program with features previously available
only at the µCode level. µCode allows for many µinstructions to execute before/between any real instructions.
My main concern was
the disconnect between how the code was written and what it actually does.
There is a 26 page specification the programmer needs to read and understand. This includes things we have not talked about--such as::
a) terminating an event without writing anything
b) proactively minimizing future interference
c) modifications to cache coherence model
at the architectural level.
The architectural specification allows for various scales of µArchitecture to independently choose how to implement esm and provide the architectural features at SW level. For example the kinds of esm activities for a 1-wide In-Order µController are vastly different that those suitable for a server scale rack of processor ensembles. What we want is one SW model that covers the whole gamut.
4th:: one cannot test esm with a random code generator, since the probability
that the random code generator creates a legal esm event is exceedingly low.
Testing and debugging any kind of locking or atomic access solution is
always very difficult. You can rarely try out conflicts or potential
race conditions in the lab - they only ever turn up at customer demos!
Right at Christmas time !! {Ask me how I know}.
On 10/12/2025 21:10, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
OK. I can see the advantages of that - though there are disadvantages
too (such as being unable to control a limit on the number of retries,
or add SW tracking of retry counts for metrics).
esm attempts to allow SW to program with features previously available
only at the µCode level. µCode allows for many µinstructions to execute before/between any real instructions.
My main concern was
the disconnect between how the code was written and what it actually does.
Perhaps it would be better to think of these sequences in assembler
rather than C - you want tighter control than C normally allows, and you don't want optimisers re-arranging things too much.
There is a 26 page specification the programmer needs to read and understand.
This includes things we have not talked about--such as::
a) terminating an event without writing anything
b) proactively minimizing future interference
c) modifications to cache coherence model
at the architectural level.
Fair enough. This is not a minor or simple feature!
The architectural specification allows for various scales of µArchitecture to independently choose how to implement esm and provide the architectural features at SW level. For example the kinds of esm activities for a 1-wide In-Order µController are vastly different that those suitable for a server scale rack of processor ensembles. What we want is one SW model that covers the whole gamut.
4th:: one cannot test esm with a random code generator, since the probability
that the random code generator creates a legal esm event is exceedingly low.
Testing and debugging any kind of locking or atomic access solution is
always very difficult. You can rarely try out conflicts or potential
race conditions in the lab - they only ever turn up at customer demos!
Right at Christmas time !! {Ask me how I know}.
We can gather round the fire, and Grampa can settle in his rocking chair
to tell us war stories from the olden days :-)
A good story is always nice, so go for it!
(We once had a system where there was a bug that not only triggered only
at the customer's site, but did so only on the 30th of September. It
took years before we made the connection to the date and found the bug.)
Heck, there are assemblers that rearrange code like this too much--
until they can be taught not to.
We both made it home for Christmas, and in some part saved the
company...
Testing and debugging any kind of locking or atomic access solution is always very difficult. You can rarely try out conflicts or potential
race conditions in the lab - they only ever turn up at customer demos!
On 10/12/2025 21:10, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
OK. I can see the advantages of that - though there are disadvantages
too (such as being unable to control a limit on the number of retries,
or add SW tracking of retry counts for metrics).
esm attempts to allow SW to program with features previously available
only at the µCode level. µCode allows for many µinstructions to execute >> before/between any real instructions.
My main concern was
the disconnect between how the code was written and what it actually
does.
Perhaps it would be better to think of these sequences in assembler
rather than C - you want tighter control than C normally allows, and you don't want optimisers re-arranging things too much.
On 12/11/2025 1:05 AM, David Brown wrote:
On 10/12/2025 21:10, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
OK. I can see the advantages of that - though there are disadvantages >>>> too (such as being unable to control a limit on the number of retries, >>>> or add SW tracking of retry counts for metrics).
esm attempts to allow SW to program with features previously available
only at the µCode level. µCode allows for many µinstructions to execute >>> before/between any real instructions.
My main concern was
the disconnect between how the code was written and what it actually
does.
Perhaps it would be better to think of these sequences in assembler
rather than C - you want tighter control than C normally allows, and
you don't want optimisers re-arranging things too much.
Right. Way back before C/C++ 11 I would code all of my sensitive lock/ wait-free code in assembly.
[...]
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
Heck, there are assemblers that rearrange code like this too much--
until they can be taught not to.
Any example? This would definitely go against what I would consider
to be reasonable for an assembler. gdb certainly does not do so.
According to Thomas Koenig <tkoenig@netcologne.de>:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
Heck, there are assemblers that rearrange code like this too much--
until they can be taught not to.
Any example? This would definitely go against what I would consider
to be reasonable for an assembler. gdb certainly does not do so.
On machines with delayed branches I've seen assemblers that move
instructions into the delay slot. Can't think of any others off hand.
On 12/11/2025 5:41 PM, John Levine wrote:
According to Thomas Koenig <tkoenig@netcologne.de>:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
Heck, there are assemblers that rearrange code like this too much--
until they can be taught not to.
Any example? This would definitely go against what I would consider
to be reasonable for an assembler. gdb certainly does not do so.
On machines with delayed branches I've seen assemblers that move
instructions into the delay slot. Can't think of any others off hand.
That would suck! Back when I used to code in SPARC assembly language, I
had full control over my delay slots. Actually, IIRC, putting a MEMBAR >instruction in a delay slot is VERY bad.
On Thu, 11 Dec 2025 20:26:09 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
We both made it home for Christmas, and in some part saved the
company...
Not for long so... Was not it dead anyway in the 6-7 months?
According to Thomas Koenig <tkoenig@netcologne.de>:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
Heck, there are assemblers that rearrange code like this too much--
until they can be taught not to.
Any example? This would definitely go against what I would consider
to be reasonable for an assembler. gdb certainly does not do so.
On machines with delayed branches I've seen assemblers that move
instructions into the delay slot. Can't think of any others off hand.
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
Heck, there are assemblers that rearrange code like this too much--
until they can be taught not to.
Any example? This would definitely go against what I would consider
to be reasonable for an assembler. gdb certainly does not do so.
On machines with delayed branches I've seen assemblers that move
instructions into the delay slot. Can't think of any others off hand.
In article <10hfrsl$145v$1@gal.iecc.com>, John Levine <johnl@taugh.com> wrote:
According to Thomas Koenig <tkoenig@netcologne.de>:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
Heck, there are assemblers that rearrange code like this too much--
until they can be taught not to.
Any example? This would definitely go against what I would consider
to be reasonable for an assembler. gdb certainly does not do so.
On machines with delayed branches I've seen assemblers that move
instructions into the delay slot. Can't think of any others off hand.
I've seen things like this, as well, particularly on machines
with multiple delay slots, where this detail was hidden from the
programmer. Or at least I have a vague memory of this; perhaps
I'm hallucinating.
More dangerous are linkers that do LTO and decide to elide code
that, no, really, I actually need for reasons that are not
apparent to the toolchain.
On 12/12/2025 14:05, Dan Cross wrote:
In article <10hfrsl$145v$1@gal.iecc.com>, John Levine <johnl@taugh.com> wrote:
According to Thomas Koenig <tkoenig@netcologne.de>:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
Heck, there are assemblers that rearrange code like this too much--
until they can be taught not to.
Any example? This would definitely go against what I would consider
to be reasonable for an assembler. gdb certainly does not do so.
On machines with delayed branches I've seen assemblers that move
instructions into the delay slot. Can't think of any others off hand.
I've seen things like this, as well, particularly on machines
with multiple delay slots, where this detail was hidden from the
programmer. Or at least I have a vague memory of this; perhaps
I'm hallucinating.
I've seen a few assemblers that do fancy things with jumps and branches
- giving you generic conditional branch pseudo-instructions that get
turned into different types of real instructions depending on the
distance needed for the jumps and the ranges supported by the
instructions. And there are plenty that have pseudo-instructions for >loading immediates into registers that generate whatever sequence of
load immediate, shift-and-or, etc., are needed.
More dangerous are linkers that do LTO and decide to elide code
that, no, really, I actually need for reasons that are not
apparent to the toolchain.
IME you have control over the details - either using directives in the >assembly, or in the linker control files. Of course that might mean >modifying code that you hoped to use untouched, and it's not hard to
forget to add a "keep" or "retain" directive.
I've found link-time dead code elimination quite useful when I have one
code base but different binary builds - sometimes all you need is a >different linker file.
According to Chris M. Thomasson <chris.m.thomasson.1@gmail.com>:
On 12/11/2025 5:41 PM, John Levine wrote:
According to Thomas Koenig <tkoenig@netcologne.de>:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
Heck, there are assemblers that rearrange code like this too much--
until they can be taught not to.
Any example? This would definitely go against what I would consider
to be reasonable for an assembler. gdb certainly does not do so.
On machines with delayed branches I've seen assemblers that move
instructions into the delay slot. Can't think of any others off hand.
That would suck! Back when I used to code in SPARC assembly language, I >had full control over my delay slots. Actually, IIRC, putting a MEMBAR >instruction in a delay slot is VERY bad.
I think they were smart enough only to move instructions that wouldn't cause problems.
In article <10hh8qe$2v9lm$1@dont-email.me>,
David Brown <david.brown@hesbynett.no> wrote:
On 12/12/2025 14:05, Dan Cross wrote:
In article <10hfrsl$145v$1@gal.iecc.com>, John Levine <johnl@taugh.com> wrote:
According to Thomas Koenig <tkoenig@netcologne.de>:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
Heck, there are assemblers that rearrange code like this too much-- >>>>>> until they can be taught not to.
Any example? This would definitely go against what I would consider >>>>> to be reasonable for an assembler. gdb certainly does not do so.
On machines with delayed branches I've seen assemblers that move
instructions into the delay slot. Can't think of any others off hand.
I've seen things like this, as well, particularly on machines
with multiple delay slots, where this detail was hidden from the
programmer. Or at least I have a vague memory of this; perhaps
I'm hallucinating.
I've seen a few assemblers that do fancy things with jumps and branches
- giving you generic conditional branch pseudo-instructions that get
turned into different types of real instructions depending on the
distance needed for the jumps and the ranges supported by the
instructions. And there are plenty that have pseudo-instructions for
loading immediates into registers that generate whatever sequence of
load immediate, shift-and-or, etc., are needed.
More dangerous are linkers that do LTO and decide to elide code
that, no, really, I actually need for reasons that are not
apparent to the toolchain.
IME you have control over the details - either using directives in the
assembly, or in the linker control files. Of course that might mean
modifying code that you hoped to use untouched, and it's not hard to
forget to add a "keep" or "retain" directive.
Provided, of course, that you have access to both the assembly
and the linker configuration for a given program. Sometimes you
don't (e.g., if the code in question is in some higher-level
language) or the linker configuration is just some default.
For example, the Plan 9 C compiler delegated actual instruction
selection to the linker; the compiler emitted a high(er)-level
representation of the operation. This made the linker free to
perform peephole optimization, potentially eliding important
instructions (like writes to MMIO regions). Fortunately, the
Plan 9 authors understood this so effectively all globals were
volatile, but when porting that code to standard C, one had to
exercise some care.
I've found link-time dead code elimination quite useful when I have one
code base but different binary builds - sometimes all you need is a
different linker file.
Agreed, it _is_ useful. But sometimes it's inappropriate.
Many early RISC assemblers were in charge of moving instructions around subject to not altering register dependencies and not altering control
flow dependencies. This allowed those assemblers to move code across
memory instructions, across long latency calculation instructions,
branch instructions, including delay slots; and redefine what "program order" now is. A bad side effect of exposing the pipeline to SW.
We mostly have gotten away from this due to "smart" instruction queueing.
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
Many early RISC assemblers were in charge of moving instructions around subject to not altering register dependencies and not altering control
flow dependencies. This allowed those assemblers to move code across
memory instructions, across long latency calculation instructions,
branch instructions, including delay slots; and redefine what "program order" now is. A bad side effect of exposing the pipeline to SW.
I never heard of that one.
Sounds like bad design - that should be done by the compiler,
not the assembler. It is fine for the compiler to have pipeline
descriptions in the cost model of the CPU under a specific -march
or -mtune flag.
(Yes, it is preferred that performance should be rather good for
code generated for a generic microarchitecture).
We mostly have gotten away from this due to "smart" instruction queueing.
What is that?
Thomas Koenig <tkoenig@netcologne.de> posted:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:Reservation stations {Value capturing and value free}, Scoreboards,
Many early RISC assemblers were in charge of moving instructions around
subject to not altering register dependencies and not altering control
flow dependencies. This allowed those assemblers to move code across
memory instructions, across long latency calculation instructions,
branch instructions, including delay slots; and redefine what "program
order" now is. A bad side effect of exposing the pipeline to SW.
I never heard of that one.
Sounds like bad design - that should be done by the compiler,
not the assembler. It is fine for the compiler to have pipeline
descriptions in the cost model of the CPU under a specific -march
or -mtune flag.
(Yes, it is preferred that performance should be rather good for
code generated for a generic microarchitecture).
We mostly have gotten away from this due to "smart" instruction queueing. >>What is that?
Dispatch stacks, and similar.
According to Chris M. Thomasson <chris.m.thomasson.1@gmail.com>:
On 12/11/2025 5:41 PM, John Levine wrote:
According to Thomas Koenig <tkoenig@netcologne.de>:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
Heck, there are assemblers that rearrange code like this too much--
until they can be taught not to.
Any example? This would definitely go against what I would consider
to be reasonable for an assembler. gdb certainly does not do so.
On machines with delayed branches I've seen assemblers that move
instructions into the delay slot. Can't think of any others off hand.
That would suck! Back when I used to code in SPARC assembly language, I
had full control over my delay slots. Actually, IIRC, putting a MEMBAR
instruction in a delay slot is VERY bad.
I think they were smart enough only to move instructions that wouldn't cause problems.
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/6/2025 5:42 AM, David Brown wrote:
On 05/12/2025 21:54, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:Such as ????
"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>>>> guarantees for multi-threaded coding on multi-core systems.You describe in many words and not really to the point what can be >>>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>>> on hardware with weaker memory ordering than sequential consistency". >>>>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>>>> guarantees that are as good on multi-core machines as on single-core >>>>>>> machines.
Basically,
it only works at the C abstract machine level - it does nothing that >>>>>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>>>>> but that says nothing about how they might progress through storage >>>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>>
However, for concurrent manipulations of data structures, one wants >>>>>>> atomic operations beyond load and store (even on single-core systems), >>>>>>
Atomic increment, compare-and-swap, locks, loads and stores of sizes >>>>> bigger than the maximum load/store size of the processor.
My 66000 ISA can::
LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction. >>>> MM can MOV up to 8192 bytes as a single ATOMIC instruction. >>>>
The functions below rely on more than that - to make the work, as far as >>> I can see, you need the first "esmLOCKload" to lock the bus and also
lock the core from any kind of interrupt or other pre-emption, lasting
until the esmLOCKstore instruction. Or am I missing something here?
Lock the BUS? Only when shit hits the fan. What about locking the cache
line? Actually, I think we can "force" an x86/x64 to lock the bus if we
do a LOCK'ed RMW on memory that straddles cache lines?
In the My 66000 case, Mem References can lock up to 8 cache lines.
On 12/8/2025 12:06 PM, MitchAlsup wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/6/2025 5:42 AM, David Brown wrote:
On 05/12/2025 21:54, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:
"volatile" /does/ provide guarantees - it just doesn't provide >>>>>>>>> enoughYou describe in many words and not really to the point what can be >>>>>>>> explained concisely as: "volatile says nothing about memory
guarantees for multi-threaded coding on multi-core systems.
Basically,
it only works at the C abstract machine level - it does nothing >>>>>>>>> that
affects the hardware. So volatile writes are ordered at the C >>>>>>>>> level,
but that says nothing about how they might progress through >>>>>>>>> storage
queues, caches, inter-processor communication buses, or whatever. >>>>>>>>
ordering
on hardware with weaker memory ordering than sequential
consistency".
If hardware guaranteed sequential consistency, volatile would >>>>>>>> provide
guarantees that are as good on multi-core machines as on single- >>>>>>>> core
machines.
However, for concurrent manipulations of data structures, one wants >>>>>>>> atomic operations beyond load and store (even on single-core
systems),
Such as ????
Atomic increment, compare-and-swap, locks, loads and stores of sizes >>>>>> bigger than the maximum load/store size of the processor.
My 66000 ISA can::
LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction. >>>>> MM can MOV up to 8192 bytes as a single ATOMIC instruction. >>>>>
The functions below rely on more than that - to make the work, as
far as
I can see, you need the first "esmLOCKload" to lock the bus and also
lock the core from any kind of interrupt or other pre-emption, lasting >>>> until the esmLOCKstore instruction. Or am I missing something here?
Lock the BUS? Only when shit hits the fan. What about locking the cache
line? Actually, I think we can "force" an x86/x64 to lock the bus if we
do a LOCK'ed RMW on memory that straddles cache lines?
In the My 66000 case, Mem References can lock up to 8 cache lines.
Pretty flexible wrt implementing those exotic things back in the day, experimental algos that need DCAS, KCSS, ect... A heck of a lot of
things can be accomplished with DWCAS, aka cmpxchg8b on a 32 bit system.
or cmpxchg16b on a 64-bit system.
People would bend over backwards to get a DCAS, or NCAS. It would be infested with strange indirection ala d"escriptors", and involved a shit load of atomic RMW's. CAS, DWCAS, XCHG and XADD can get a lot done.
On 12/8/2025 12:06 PM, MitchAlsup wrote:[...]
People would bend over backwards to get a DCAS, or NCAS. It would be infested with strange indirection ala d"escriptors", and involved a shit load of atomic RMW's. CAS, DWCAS, XCHG and XADD can get a lot done.
On 12/12/2025 2:37 PM, Chris M. Thomasson wrote:
On 12/8/2025 12:06 PM, MitchAlsup wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/6/2025 5:42 AM, David Brown wrote:
On 05/12/2025 21:54, MitchAlsup wrote:Lock the BUS? Only when shit hits the fan. What about locking the cache >>> line? Actually, I think we can "force" an x86/x64 to lock the bus if we >>> do a LOCK'ed RMW on memory that straddles cache lines?
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:
"volatile" /does/ provide guarantees - it just doesn't provide >>>>>>>>> enoughYou describe in many words and not really to the point what can be >>>>>>>> explained concisely as: "volatile says nothing about memory >>>>>>>> ordering
guarantees for multi-threaded coding on multi-core systems. >>>>>>>>> Basically,
it only works at the C abstract machine level - it does nothing >>>>>>>>> that
affects the hardware. So volatile writes are ordered at the C >>>>>>>>> level,
but that says nothing about how they might progress through >>>>>>>>> storage
queues, caches, inter-processor communication buses, or whatever. >>>>>>>>
on hardware with weaker memory ordering than sequential
consistency".
If hardware guaranteed sequential consistency, volatile would >>>>>>>> provide
guarantees that are as good on multi-core machines as on single- >>>>>>>> core
machines.
However, for concurrent manipulations of data structures, one wants >>>>>>>> atomic operations beyond load and store (even on single-core >>>>>>>> systems),
Such as ????
Atomic increment, compare-and-swap, locks, loads and stores of sizes >>>>>> bigger than the maximum load/store size of the processor.
My 66000 ISA can::
LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction. >>>>> MM can MOV up to 8192 bytes as a single ATOMIC instruction.
The functions below rely on more than that - to make the work, as
far as
I can see, you need the first "esmLOCKload" to lock the bus and also >>>> lock the core from any kind of interrupt or other pre-emption, lasting >>>> until the esmLOCKstore instruction. Or am I missing something here? >>>
In the My 66000 case, Mem References can lock up to 8 cache lines.
Pretty flexible wrt implementing those exotic things back in the day, experimental algos that need DCAS, KCSS, ect... A heck of a lot of
things can be accomplished with DWCAS, aka cmpxchg8b on a 32 bit system. or cmpxchg16b on a 64-bit system.
People would bend over backwards to get a DCAS, or NCAS. It would be infested with strange indirection ala d"escriptors", and involved a shit load of atomic RMW's. CAS, DWCAS, XCHG and XADD can get a lot done.
Have you ever read about KCSS?
https://groups.google.com/g/comp.arch/c/shshLdF1uqs
https://patents.google.com/patent/US7293143
Most extant SMP processors provide a compare and swap operation, which
are widely supported by the common compilers that support the C and C++ threading functionality.
On 12/8/2025 9:14 AM, Scott Lurndal wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 12/8/2025 4:25 AM, Robert Finch wrote:
<snip>
I am having trouble understanding how the block of code in the
esmINTERFERENCE() block is protected so that the whole thing
executes as
a unit. It would seem to me that the address range(s) needing to be
locked would have to be supplied throughout the system, including
across
buffers and bus bridges. It would have to go to the memory coherence
point. Otherwise, some other device using a bridge could update the
same
address range in the middle of an update.
I may be wrong about this, but I think you have a misconception. The
ESM doesn't *prevent* interference, but it *detect* interference. Thus >>> nothing is required of other cores, no locks, etc. If they write to a
"protected" location, the write is allowed, but the core in the ESM is
notified, so it can redo the ESM protected code.
Sounds very much similar to the ARMv8 concept of an "exclusive monitor"
(the basis of the Store-Exclusive/Load-Exclusive instructions, which
mirror the LL/SC paradigm). The ARMv8 monitors an implementation defined >> range surrounding the target address and the store will fail if any other
agent has modified any byte within the exclusive range.
Any mutation the reservation granule?
What my solution entails is a modification
to the cache coherence model (NaK) that indicates "Yes I have the line you >referenced, but, no you can't have it right now" in order to strengthen
the guarantees of forward progress.
On 12/8/2025 4:31 PM, Chris M. Thomasson wrote:
On 12/8/2025 9:14 AM, Scott Lurndal wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 12/8/2025 4:25 AM, Robert Finch wrote:
<snip>
I am having trouble understanding how the block of code in the
esmINTERFERENCE() block is protected so that the whole thing
executes as
a unit. It would seem to me that the address range(s) needing to be
locked would have to be supplied throughout the system, including
across
buffers and bus bridges. It would have to go to the memory coherence >>>> point. Otherwise, some other device using a bridge could update the >>>> same
address range in the middle of an update.
I may be wrong about this, but I think you have a misconception. The >>> ESM doesn't *prevent* interference, but it *detect* interference. Thus >>> nothing is required of other cores, no locks, etc. If they write to a >>> "protected" location, the write is allowed, but the core in the ESM is >>> notified, so it can redo the ESM protected code.
Sounds very much similar to the ARMv8 concept of an "exclusive monitor"
(the basis of the Store-Exclusive/Load-Exclusive instructions, which
mirror the LL/SC paradigm). The ARMv8 monitors an implementation defined >> range surrounding the target address and the store will fail if any other >> agent has modified any byte within the exclusive range.
Any mutation the reservation granule?
I forgot if a load from the reservation granule would cause a LL/SC to
fail. I know a store would. False sharing in poorly written programs
would cause it to occur. LL/SC experiencing live lock. This was back in
my PPC days.
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
What my solution entails is a modification
to the cache coherence model (NaK) that indicates "Yes I have the line you >referenced, but, no you can't have it right now" in order to strengthen >the guarantees of forward progress.
How does it strengthen the guarantees of forward progress?
My guess:
If the requester itself is in an atomic sequence B, it will cancel it.
This could help if the atomic sequence A that caused the NaK then
tries to get a cache line that would be kept by B.
There is still a chance of both sequences canceling each other by
sending NaKs at the same time, but it is smaller and with something
like exponential backoff eventual forward progress could be achieved.
- anton--- Synchronet 3.21a-Linux NewsLink 1.2
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
What my solution entails is a modification
to the cache coherence model (NaK) that indicates "Yes I have the line you >>> referenced, but, no you can't have it right now" in order to strengthen
the guarantees of forward progress.
How does it strengthen the guarantees of forward progress?
The allowance of a NaK is only available under somewhat special circumstances::
a) in Careful mode:: when core can see that all STs have write permission
and data is present, NaKs allow the Modification part to run to
completion.
b) In Slow and Methodical mode:: core can NaK any access to any of its
cache lines--preventing interference.
My guess:
If the requester itself is in an atomic sequence B, it will cancel it.
Yes, the "other guy" takes the hit not the guy who has made more forward progress. If B was an innocent accessor of the data, it retires its request--this generally takes 100-odd cycles, allowing A to complete
the event by the time the innocent request shows up again.
This could help if the atomic sequence A that caused the NaK then
tries to get a cache line that would be kept by B.
There is still a chance of both sequences canceling each other by
sending NaKs at the same time, but it is smaller and with something
like exponential backoff eventual forward progress could be achieved.
Instead of some contrived back-off policy--at the failure point one can
read the WHY register. 0 indicates success; negative indicates spurious, positive indicates how far down the line of requestors YOU happen to be.
So, if you are going after a unit of work, you march down the queue WHY
units and then YOU are guaranteed that YOU are the only one after that
unit of work.
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/8/2025 4:31 PM, Chris M. Thomasson wrote:
On 12/8/2025 9:14 AM, Scott Lurndal wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 12/8/2025 4:25 AM, Robert Finch wrote:
<snip>
I am having trouble understanding how the block of code in the
esmINTERFERENCE() block is protected so that the whole thing
executes as
a unit. It would seem to me that the address range(s) needing to be >>>>>> locked would have to be supplied throughout the system, including
across
buffers and bus bridges. It would have to go to the memory coherence >>>>>> point. Otherwise, some other device using a bridge could update the >>>>>> same
address range in the middle of an update.
I may be wrong about this, but I think you have a misconception. The >>>>> ESM doesn't *prevent* interference, but it *detect* interference. Thus >>>>> nothing is required of other cores, no locks, etc. If they write to a >>>>> "protected" location, the write is allowed, but the core in the ESM is >>>>> notified, so it can redo the ESM protected code.
Sounds very much similar to the ARMv8 concept of an "exclusive monitor" >>>> (the basis of the Store-Exclusive/Load-Exclusive instructions, which
mirror the LL/SC paradigm). The ARMv8 monitors an implementation defined >>>> range surrounding the target address and the store will fail if any other >>>> agent has modified any byte within the exclusive range.
Any mutation the reservation granule?
I forgot if a load from the reservation granule would cause a LL/SC to
fail. I know a store would. False sharing in poorly written programs
would cause it to occur. LL/SC experiencing live lock. This was back in
my PPC days.
A LD to the granule would cause loss of write permission, causing a long delay to perform SC and greatly increase the probability of interference.
On 12/13/2025 11:12 AM, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
What my solution entails is a modification
to the cache coherence model (NaK) that indicates "Yes I have the line you
referenced, but, no you can't have it right now" in order to strengthen >>> the guarantees of forward progress.
How does it strengthen the guarantees of forward progress?
The allowance of a NaK is only available under somewhat special circumstances::
a) in Careful mode:: when core can see that all STs have write permission
and data is present, NaKs allow the Modification part to run to
completion.
b) In Slow and Methodical mode:: core can NaK any access to any of its
cache lines--preventing interference.
My guess:
If the requester itself is in an atomic sequence B, it will cancel it.
Yes, the "other guy" takes the hit not the guy who has made more forward progress. If B was an innocent accessor of the data, it retires its request--this generally takes 100-odd cycles, allowing A to complete
the event by the time the innocent request shows up again.
This could help if the atomic sequence A that caused the NaK then
tries to get a cache line that would be kept by B.
There is still a chance of both sequences canceling each other by
sending NaKs at the same time, but it is smaller and with something
like exponential backoff eventual forward progress could be achieved.
Instead of some contrived back-off policy--at the failure point one can read the WHY register. 0 indicates success; negative indicates spurious, positive indicates how far down the line of requestors YOU happen to be. So, if you are going after a unit of work, you march down the queue WHY units and then YOU are guaranteed that YOU are the only one after that
unit of work.
Step one. Make sure that a failure means another thread made progress. strong CAS does this. Don't let it spuriously fail where nothing makes progress... ;^o
Oh my we got a load on the reservation granule, abort all LL/SC in
progress wrt that granule. Of course this assumes that the user that
created the program for it gets things right.
For a LL/SC on the PPC it definitely helps where things are aligned and padded up to a reservation granule, not just a l2 cache line. Helps mitigate false sharing causing livelock.
Even in weak CAS, akin to LL/SC. Well, how sensitive is that reservation granule. Can a simple load cause a failure?
| Sysop: | DaiTengu |
|---|---|
| Location: | Appleton, WI |
| Users: | 1,090 |
| Nodes: | 10 (0 / 10) |
| Uptime: | 159:53:10 |
| Calls: | 13,922 |
| Files: | 187,021 |
| D/L today: |
888 files (250M bytes) |
| Messages: | 2,457,303 |