Skip Carter did not post in this thread, but given that he proposed
the change, he probably found 6 to be too few; or maybe it was just a >phenomenon that we also see elsewhere as range anxiety. In any case,
he made no such proposal to Forth-200x, so apparently the need was not >pressing.
In any case, in almost all cases I use the default FP pack, and here
the VFX-5 and SwiftForth-4 approach is unbeatable in simplicity.
Instead of performing the sequence of commands shown above, I just
start the Forth system, and FP words are ready.
- anton
On 6/07/2025 9:30 pm, Anton Ertl wrote:
dxf <dxforth@gmail.com> writes:
On 5/07/2025 6:49 pm, Anton Ertl wrote:
dxf <dxforth@gmail.com> writes:
[8 stack items on the FP stack]
Puzzling because of a thread here not long ago in which scientific users >>>>> appear to suggest the opposite. Such concerns have apparently been aroundI have read through the thread. It's unclear to me which scientific
a long time:
https://groups.google.com/g/comp.lang.forth/c/CApt6AiFkxo/m/wwZmc_Tr1PcJ >>>>
users you have in mind. My impression is that 8 stack items was
deemed sufficient by many, and preferable (on 387) for efficiency
reasons.
AFAICS both Skip Carter (proponent) and Julian Noble were suggesting the >>> 6 level minimum were inadequate.
Skip Carter did not post in this thread, but given that he proposed
the change, he probably found 6 to be too few; or maybe it was just a
phenomenon that we also see elsewhere as range anxiety. In any case,
he made no such proposal to Forth-200x, so apparently the need was not
pressing.
Julian Noble ignored the FP stack size issue in his first posting in
this thread, unlike the separate FP stack size issue, which he
supported. So it seems that he did not care about a larger FP stack
size. In the other posting he endorsed moving FP stack items to the
data stack, but he did not write why; for all we know he might have
wanted that as a first step for getting the mantissa, exponent and
sign of the FP value as integer (and the other direction for
synthesizing FP numbers from these parts).
He appears to dislike the idea of standard-imposed minimums (e.g. Carter's suggestion of 16) but suggested:
a) the user can offload to memory if necessary from
fpu hardware;
b) an ANS FLOATING and FLOATING EXT wordset includes
the necessary hooks to extend the fp stack.
On 2025-07-03 17:11, albert@spenarnc.xs4all.nl wrote:
In article <1043831$3ggg9$1@dont-email.me>,
Ruvim <ruvim.pinka@gmail.com> wrote:
On 2025-07-02 15:37, albert@spenarnc.xs4all.nl wrote:
In article <1042s2o$3d58h$1@dont-email.me>,
Ruvim <ruvim.pinka@gmail.com> wrote:
On 2025-06-24 01:03, minforth wrote:
[...]
For me, the small syntax extension is a convenience when working
with longer definitions. A bit contrived (:= synonym for TO):
: SOME-APP { a f: b c | temp == n: flag z: freq }
\ inputs: integer a, floats b c
\ uninitialized: float temp
\ outputs: integer flag, complex freq
   <: FUNC < ... calc function ... > ;>
BTW, why do you prefer the special syntax `<: ... ;>`
over an extension to the existing words `:` and `;`
   : SOME-APP
      [ : FUNC < ... calc function ... > ; ]
      < ... >
   ;
In this approach the word `:` knows that it's a nested definition and >>>>> behaves accordingly.
Or it has not even know it, if [ is smart enough to compile a jump to
after ].
This can be tricky because the following should work:
  create foo [ 123 , ] [ 456 ,
  : bar [ ' foo compile, 123 lit, ] ;
If this bothers you, rename it in [[ ]].
Once we enhance [ ] to do things prohibited by the standard,
(adding nested definitions) I can't be bothered with this too much.
The standard does not prohibit a system from supporting nested
definitions in whichever way that does not violate the standard behavior.
Yes, something like "private[ ... ]private" is a possible approach, and
its implementation seems simpler than adding the smarts to `:` and `;`
(and other defining words, if any).
The advantage of this approach over "<: ... ;>" is that you can define
not only colon-definitions, but also constants, variables, immediate
words, one-time macros, etc.
 : foo ( F: r.coefficient -- r.result )
   private[
     variable cnt
     0e fvalue k
     : [x] ... ; immediate
   ]private
   to k  0 cnt !
   ...
 ;
It's also possible to associated the word list of private words with the containing word xt for debugging purposes.
On 07-07-2025 05:48, dxf wrote:
...
He appears to dislike the idea of standard-imposed minimums (e.g. Carter's >> suggestion of 16) but suggested:
  a) the user can offload to memory if necessary from
  fpu hardware;
  b) an ANS FLOATING and FLOATING EXT wordset includes
  the necessary hooks to extend the fp stack.
In 4tH, there are two (highlevel) FP-systems - with 6 predetermined configurations. Configs number 0-2 don't have an FP stack, they use the datastack. 3-5 have a separate FP stack - and double the precision. The standard FP stacksize is 16, you can extend it by defining a constant before including the FP libs.
As for SSE2 it wouldn't exist if industry didn't consider
double-precision adequate.
dxf <dxforth@gmail.com> writes:
As for SSE2 it wouldn't exist if industry didn't consider
double-precision adequate.
SSE2 is/was first and foremost a vectorizing extension, and it has been superseded quite a few times, indicating it was never all that
adequate. I don't know whether any of its successors support extended precision though.
dxf <dxforth@gmail.com> writes:
As for SSE2 it wouldn't exist if industry didn't consider
double-precision adequate.
SSE2 is/was first and foremost a vectorizing extension, and it has been superseded quite a few times, indicating it was never all that
adequate. I don't know whether any of its successors support extended precision though.
W. Kahan was a big believer in extended precision (that's why the 8087
had it from the start). I believes IEEE specifies both 80 bit and 128
bit formats in addition to 64 bit. The RISC-V spec includes encodings
for 128 bit IEEE but I don't know if any RISC-V hardware actually
implements it. I think there are some IBM mainframe CPUs that have it.
You don't need 64-bit doubles for signal or image processing.
Most vector/matrix operations on streaming data don't require
them either. Whether SSE2 is adequate or not to handle such data
depends on the application.
"Industry" can manage well with 32-bit floats or even smaller with non-standard number formats.
I suspect IEEE simply standardized what had become common practice among implementers.
What little I know about SSE2 it's not as well thought out or organized
as Intel's original effort. E.g. doing something as simple as changing
sign of an fp number is a pain when NANs are factored in.
minforth <minforth@gmx.net> writes:
You don't need 64-bit doubles for signal or image processing.
Most vector/matrix operations on streaming data don't require
them either. Whether SSE2 is adequate or not to handle such data
depends on the application.
Sure, and for that matter, AI inference uses 8 bit and even 4 bit
floating point.
Kahan on the other hand was interested in engineering
and scientific applications like PDE solvers (airfoils, fluid dynamics,
FEM, etc.). That's an area where roundoff error builds up after many iterations, thus extended precision.
dxf <dxforth@gmail.com> writes:
I suspect IEEE simply standardized what had become common practice among
implementers.
No, it was really new and interesting. https://people.eecs.berkeley.edu/~wkahan/ieee754status/754story.html
What little I know about SSE2 it's not as well thought out or organized
as Intel's original effort. E.g. doing something as simple as changing
sign of an fp number is a pain when NANs are factored in.
I wonder if later SSE/AVX/whatever versions fixed this stuff.
I don't do parallelization, but I was still surprised by the good
results using FMA. In other words, increasing floating-point number
size is not always the way to go.
Anyhow, first step is to select the best fp rounding method ....
dxf <dxforth@gmail.com> writes:
As for SSE2 it wouldn't exist if industry didn't consider
double-precision adequate.
SSE2 is/was first and foremost a vectorizing extension, and it has been >superseded quite a few times, indicating it was never all that
adequate.
I don't know whether any of its successors support extended
precision though.
W. Kahan was a big believer in extended precision (that's why the 8087
had it from the start). I believes IEEE specifies both 80 bit and 128
bit formats in addition to 64 bit.
I suspect IEEE simply standardized what had become common practice among >implementers.
By using 80 bits /internally/ Intel went a long way to
achieving IEEE's spec for double precision.
E.g. doing something as simple as changing
sign of an fp number is a pain when NANs are factored in.
The catch with SSE is there's nothing like FCHS or FABS
so depending on how one implements them, results vary across implementations.
"Industry" can manage well with 32-bit
floats or even smaller with non-standard number formats.
On 10 Jul 2025 at 02:18:50 CEST, "minforth" <minforth@gmx.net> wrote:
"Industry" can manage well with 32-bit
floats or even smaller with non-standard number formats.
My customers beg to differ and some use 128 bit numbers for
their work. In a construction estimate for one runway for the
new Hong Kong airport, the cost difference between a 64 bit FP
calculation and the integer calculation was US 10 million dollars.
This was for pile capping which involves a large quantity of relatively
small differences.
dxf <dxforth@gmail.com> writes:
The catch with SSE is there's nothing like FCHS or FABS
so depending on how one implements them, results vary across implementations.
You can see in Gforth how to implement FNEGATE and FABS with SSE2:
see fnegate
Code fnegate
0x000055e6a78a8274: add $0x8,%rbx
0x000055e6a78a8278: xorpd 0x24d8f(%rip),%xmm15 # 0x55e6a78cd010
0x000055e6a78a8281: mov %r15,%r9
0x000055e6a78a8284: mov (%rbx),%rax
0x000055e6a78a8287: jmp *%rax
end-code
ok
0x55e6a78cd010 16 dump
55E6A78CD010: 00 00 00 00 00 00 00 80 - 00 00 00 00 00 00 00 00
ok
see fabs
Code fabs
0x000055e6a78a84fe: add $0x8,%rbx
0x000055e6a78a8502: andpd 0x24b15(%rip),%xmm15 # 0x55e6a78cd020
0x000055e6a78a850b: mov %r15,%r9
0x000055e6a78a850e: mov (%rbx),%rax
0x000055e6a78a8511: jmp *%rax
end-code
ok
0x55e6a78cd020 16 dump
55E6A78CD020: FF FF FF FF FF FF FF 7F - 00 00 00 00 00 00 00 00
The actual implementation is the xorpd instruction for FNEGATE, and in
the andpd instruction for FABS. The memory locations contain masks:
for FNEGATE only the sign bit is set, for FABS everything but the sign
bit is set.
Sure you can implement FNEGATE and FABS in more complicated ways, but
you can also implement them in more complicated ways if you use the
387 instruction set. Here's an example of more complicated
implementations:
see fnegate
FNEGATE
( 004C4010 4833C0 ) XOR RAX, RAX
( 004C4013 F34D0F7EC8 ) MOVQ XMM9, XMM8
( 004C4018 664C0F6EC0 ) MOVQ XMM8, RAX
( 004C401D F2450F5CC1 ) SUBSD XMM8, XMM9
( 004C4022 C3 ) RET/NEXT
( 19 bytes, 5 instructions )
ok
see fabs
FABS
( 004C40B0 E8FBEFFFFF ) CALL 004C30B0 FS@
( 004C40B5 4885DB ) TEST RBX, RBX
( 004C40B8 488B5D00 ) MOV RBX, [RBP]
( 004C40BC 488D6D08 ) LEA RBP, [RBP+08]
( 004C40C0 0F8D05000000 ) JNL/GE 004C40CB
( 004C40C6 E845FFFFFF ) CALL 004C4010 FNEGATE
( 004C40CB C3 ) RET/NEXT
( 28 bytes, 7 instructions )
I believes IEEE specifies both 80 bit and 128 bit formats in additionNot 80-bit format. binary128 and binary256 are specified.
to 64 bit.
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
I believes IEEE specifies both 80 bit and 128 bit formats in additionNot 80-bit format. binary128 and binary256 are specified.
to 64 bit.
I see, 80 bits is considered double-extended. "The x87 and Motorola
68881 80-bit formats meet the requirements of the IEEE 754-1985 double extended format,[12] as does the IEEE 754 128-bit binary format." (https://en.wikipedia.org/wiki/Extended_precision)
Interestingly, Kahan's 1997 report on IEEE 754's status does say 80 bit
is specified. But it sounds like that omits some nuance.
https://people.eecs.berkeley.edu/~wkahan/ieee754status/IEEE754.PDF
Kahan was also overly critical of dynamic Unum/Posit formats.
Time has shown that he was partially wrong: https://spectrum.ieee.org/floating-point-numbers-posits-processor
Am 10.07.2025 um 21:33 schrieb Paul Rubin:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
I believes IEEE specifies both 80 bit and 128 bit formats in additionNot 80-bit format. binary128 and binary256 are specified.
to 64 bit.
I see, 80 bits is considered double-extended. "The x87 and Motorola
68881 80-bit formats meet the requirements of the IEEE 754-1985 double
extended format,[12] as does the IEEE 754 128-bit binary format."
(https://en.wikipedia.org/wiki/Extended_precision)
Interestingly, Kahan's 1997 report on IEEE 754's status does say 80 bit
is specified. But it sounds like that omits some nuance.
https://people.eecs.berkeley.edu/~wkahan/ieee754status/IEEE754.PDF
Kahan was also overly critical of dynamic Unum/Posit formats.
Time has shown that he was partially wrong: https://spectrum.ieee.org/floating-point-numbers-posits-processor
minforth <minforth@gmx.net> writes:
Kahan was also overly critical of dynamic Unum/Posit formats.
Time has shown that he was partially wrong:
https://spectrum.ieee.org/floating-point-numbers-posits-processor
I don't feel qualified to draw a conclusion from this. I wonder what
the numerics community thinks, if there is any consensus. I remember
being dubious of posits when I first heard of them, though Kahan
probably influenced that. I do know that IEEE 754 took a lot of trouble
to avoid undesirable behaviours that never would have occurred to most
of us. No idea how well posits do at that. I guess though, given the continued attention they get, they must be more interesting than I had thought.
I saw one of the posit articles criticizing IEEE 754 because IEEE 754 addition is not always associative. But that is inherent in how
floating point arithmetic works, and I don't see how posit addition can
avoid it. Let a = 1e100, b = -1e100, and c=1. So mathematically,
a+b+c=1. You should get that from (a+b)+c in your favorite floating
point format. But a+(b+c) will almost certainly be 0, without very high precision (300+ bits).
When someone begins with the line it rarely ends well:
"Twenty years ago anarchy threatened floating-point arithmetic."
One floating-point to rule them all.
dxf <dxforth@gmail.com> writes:
When someone begins with the line it rarely ends well:
"Twenty years ago anarchy threatened floating-point arithmetic."
One floating-point to rule them all.
This gives a good perspective on posits:
https://people.eecs.berkeley.edu/~demmel/ma221_Fall20/Dinechin_etal_2019.pdf
Floating point arithmetic in the 1960s (before my time) was really in a terrible state. Kahan has written about it. Apparently IBM 360
floating point arithmetic had to be redesigned after the fact, because
the original version had such weird anomalies.
dxf <dxforth@gmail.com> writes:
When someone begins with the line it rarely ends well:
"Twenty years ago anarchy threatened floating-point arithmetic."
One floating-point to rule them all.
This gives a good perspective on posits:
https://people.eecs.berkeley.edu/~demmel/ma221_Fall20/Dinechin_etal_2019.pdf
Am 10.07.2025 um 21:33 schrieb Paul Rubin:
Kahan was also overly critical of dynamic Unum/Posit formats.
Time has shown that he was partially wrong: >https://spectrum.ieee.org/floating-point-numbers-posits-processor
But was it the case by the mid/late 70's - or certain individuals saw an opportunity to influence the burgeoning microprocessor market? Notions of single and double precision already existed in software floating point -
I have looked at a (IIRC) slide deck by Kahan where he shows examples
where the altenarnative by Gustafson (don't remember which one he
looked at in that slide deck) fails and traditional FP numbers work.
I guess though, given the
continued attention they get, they must be more interesting than I had >thought.
I saw one of the posit articles criticizing IEEE 754 because IEEE 754 >addition is not always associative. But that is inherent in how
floating point arithmetic works, and I don't see how posit addition can
avoid it.
dxf <dxforth@gmail.com> writes:
But was it the case by the mid/late 70's - or certain individuals saw an
opportunity to influence the burgeoning microprocessor market? Notions
of
single and double precision already existed in software floating point -
Hardware floating point also had single and double precision. The
really awful 1960s systems were gone by the mid 70s. But there were a
lot of competing formats, ranging from bad to mostly-ok. VAX floating
point was mostly ok, DEC wanted IEEE to adopt it, Kahan was ok with
that, but Intel thought "go for the best possible". Kahan's
retrospectives on this stuff are good reading:
On 11/07/2025 1:17 pm, Paul Rubin wrote:
This gives a good perspective on posits:
https://people.eecs.berkeley.edu/~demmel/ma221_Fall20/Dinechin_etal_2019.pdf
Floating point arithmetic in the 1960s (before my time) was really in a
terrible state. Kahan has written about it. Apparently IBM 360
floating point arithmetic had to be redesigned after the fact, because
the original version had such weird anomalies.
But was it the case by the mid/late 70's - or certain individuals saw an >opportunity to influence the burgeoning microprocessor market?
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
I have looked at a (IIRC) slide deck by Kahan where he shows examples
where the altenarnative by Gustafson (don't remember which one he
looked at in that slide deck) fails and traditional FP numbers work.
Maybe this: http://people.eecs.berkeley.edu/~wkahan/UnumSORN.pdf
What is there not to like with the FPU? It provides 80 bits, which
is in itself a useful additional format, and should never have problems
with single and double-precision edge cases.
The only problem is that some languages and companies find it necessary
to boycott FPU use.
mhx@iae.nl (mhx) writes:
What is there not to like with the FPU? It provides 80 bits, which
is in itself a useful additional format, and should never have problems
with single and double-precision edge cases.
If you want to do double precision, using the 387 stack has the double-rounding problem <https://en.wikipedia.org/wiki/Rounding#Double_rounding>. Even if you
limit the mantissa to 53 bits, you still get double rounding when you
deal with numbers that are denormal numbers in binary64
representation. Java wanted to give the same results, bit for bit, on
all hardware, and ran afoul of this until they could switch to SSE2.
The only problem is that some languages and companies find it necessary
to boycott FPU use.
The rest of the industry has standardized on binary64 and binary32,
and they prefer bit-equivalent results for ease of testing. So as
soon as SSE2 gave that to them, they flocked to SSE2.
...
In any case, FP numbers are used in very diverse ways. Not everybody
needs all the features, and even fewer features are consciously
needed, but that's the usual case with things that are not
custom-taylored for your application.
On 11/07/2025 8:22 pm, Anton Ertl wrote:
The rest of the industry has standardized on binary64 and binary32,
and they prefer bit-equivalent results for ease of testing. So as
soon as SSE2 gave that to them, they flocked to SSE2.
...
I wonder how much of this is academic or trend inspired?
AFAICS Forth
clients haven't flocked to it else vendors would have SSE2 offerings at
the same level as their x387 packs.
...
For Forth, Inc. and MPE AFAIK their respective IA-32 Forth system was
the only one with hardware FP for many years, so there probably was
little pressure from users for bit-identical results with, say, SPARC, because they did not have a Forth system that ran on SPARC.
...
And as long as customers did not ask for bit-identical results to
those on, say, a Raspi, there was little reason to reimplement FP with
SSE2. I wonder if the development of the SSE2 package for VFX was
influenced by the availability of VFX for the Raspi.
These Forth systems also don't do global register allocation or auto-vectorization, so two other reasons why, e.g., C compilers chose
to use SSE2 on AMD64 (where SSE2 was guaranteed to be available) don't
exist for them.
- anton
On 13/07/2025 7:01 pm, Anton Ertl wrote:
...
For Forth, Inc. and MPE AFAIK their respective IA-32 Forth system was
the only one with hardware FP for many years, so there probably was
little pressure from users for bit-identical results with, say, SPARC,
because they did not have a Forth system that ran on SPARC.
What do you mean by "bit-identical results"? Since SSE2 comes without >transcendentals (or basics such as FABS and FNEGATE) and implementers
are expected to supply their own, if anything, I expect results across >platforms and compilers to vary.
dxf <dxforth@gmail.com> writes:
On 13/07/2025 7:01 pm, Anton Ertl wrote:
...
For Forth, Inc. and MPE AFAIK their respective IA-32 Forth system
was the only one with hardware FP for many years, so there
probably was little pressure from users for bit-identical results
with, say, SPARC, because they did not have a Forth system that
ran on SPARC.
What do you mean by "bit-identical results"? Since SSE2 comes
without transcendentals (or basics such as FABS and FNEGATE) and >implementers are expected to supply their own, if anything, I expect >results across platforms and compilers to vary.
There are operations for which IEEE 754 specifies the result to the
last bit (except that AFAIK the representation of NaNs is not
specified exactly), among them F+ F- F* F/ FSQRT, probably also
FNEGATE and FABS. It does not specify the exact result for
transcendental functions, but if your implementation performs the same bit-exact operations for computing a transcendental function on two
IEEE 754 compliant platforms, the result will be bit-identical (if it
is a number). So just use the same implementations of transcentental functions, and your results will be bit-identical; concerning the
NaNs, if you find a difference, check if the involved values are NaNs.
- anton
[..] if your implementation performs the same
bit-exact operations for computing a transcendental function on two
IEEE 754 compliant platforms, the result will be bit-identical (if it
is a number). So just use the same implementations of transcentental functions, and your results will be bit-identical; concerning the
NaNs, if you find a difference, check if the involved values are NaNs.
So just use the same implementations of transcentental functions, and
your results will be bit-identical
On Mon, 14 Jul 2025 6:04:13 +0000, Anton Ertl wrote:
[..] if your implementation performs the same
bit-exact operations for computing a transcendental function on two
IEEE 754 compliant platforms, the result will be bit-identical (if it
is a number). So just use the same implementations of transcentental
functions, and your results will be bit-identical; concerning the
NaNs, if you find a difference, check if the involved values are NaNs.
When e.g. summing the elements of a DP vector, it is hard to see why
that couldn't be done on the FPU stack (with 80 bits) before (possibly) >storing the result to a DP variable in memory. I am not sure that Forth
users would be able to resist that approach.
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
So just use the same implementations of transcentental functions, and
your results will be bit-identical
Same implementations = same FP operations in the exact same order?
That
seems hard to ensure, if the functions are implemented in a language
that leaves anything up to a compiler.
Also, in the early implementations x87, 68881, NS320something(?), >transcententals were included in the coprocessor and the workings
weren't visible.
mhx@iae.nl (mhx) writes:[..]
On Mon, 14 Jul 2025 6:04:13 +0000, Anton Ertl wrote:
The question is: What properties do you want your computation to have?[..]
2) A more accurate result? How much more accuracy?
3) More performance?
C) Perform tree addition
a) Using 80-bit addition. This will be faster than sequential
addition because in many cases several additions can run in
parallel. It will also be quite accurate because it uses 80-bit
addition, and because the addition chains are reduced to
ld(length(vector)).
So, as you can see, depending on your objectives there may be more
attractive ways to add a vector than what you suggested. Your
suggestion actually looks pretty unattractive, except if your
objectives are "ease of implementation" and "more accuracy than the
naive approach".
This looks very interesting. I can find Kahan and Neumaier, but
"tree addition" didn't turn up (There is a suspicious looking
reliability paper about the approach which surely is not what
you meant). Or is it pairwise addition what I should look for?
Now riscv is the future.
Now riscv is the future.
I don't know. From what I learned, RISC-V
is strongly compiler-oriented. They wrote,
for example, that it lacks any condition codes.
Only conditional branches are predicated on
examining the contents of registers at the time
of the branch. No "add with carry" nor "subtract
with carry". From an assembly point of view, the
lack of a carry flag is a PITA if you desire to
do multi-word mathematical manipulation of numbers.
So it seems, that the RISC-V architecture is intended
to be used by compilers generating code from high level
languages.
Am 15.07.2025 um 17:25 schrieb LIT:
Now riscv is the future.
I don't know. From what I learned, RISC-V
is strongly compiler-oriented. They wrote,
for example, that it lacks any condition codes.
Only conditional branches are predicated on
examining the contents of registers at the time
of the branch. No "add with carry" nor "subtract
with carry". From an assembly point of view, the
lack of a carry flag is a PITA if you desire to
do multi-word mathematical manipulation of numbers.
So it seems, that the RISC-V architecture is intended
to be used by compilers generating code from high level
languages.
I read somewhere:
The standard is now managed by RISC-V International, which
has more than 3,000 members and which reported that more
than 10 billion chips containing RISC-V cores had shipped
by the end of 2022. Many implementations of RISC-V are
available, both as open-source cores and as commercial
IP products.
You call that compiler-oriented???
On 16/07/2025 12:09 pm, minforth wrote:
Am 15.07.2025 um 17:25 schrieb LIT:
Now riscv is the future.
I don't know. From what I learned, RISC-V
is strongly compiler-oriented. They wrote,
for example, that it lacks any condition codes.
Only conditional branches are predicated on
examining the contents of registers at the time
of the branch. No "add with carry" nor "subtract
with carry". From an assembly point of view, the
lack of a carry flag is a PITA if you desire to
do multi-word mathematical manipulation of numbers.
So it seems, that the RISC-V architecture is intended
to be used by compilers generating code from high level
languages.
I read somewhere:
The standard is now managed by RISC-V International, which
has more than 3,000 members and which reported that more
than 10 billion chips containing RISC-V cores had shipped
by the end of 2022. Many implementations of RISC-V are
available, both as open-source cores and as commercial
IP products.
You call that compiler-oriented???
It depends on how many are being programmed by the likes of GCC.
When ATMEL hit the market the manufacturer claimed their chips
were designed with compilers in mind. Do Arduino users program
in hand-coded assembler? Do you? It's no longer just the chip's
features and theoretical performance one has to worry about but
the compilers too.
Now riscv is the future.
I don't know. From what I learned, RISC-V
is strongly compiler-oriented. They wrote,
for example, that it lacks any condition codes.
Only conditional branches are predicated on
examining the contents of registers at the time
of the branch. No "add with carry" nor "subtract
with carry". From an assembly point of view, the
lack of a carry flag is a PITA if you desire to
do multi-word mathematical manipulation of numbers.
So it seems, that the RISC-V architecture is intended
to be used by compilers generating code from high level
languages.
I read somewhere:
The standard is now managed by RISC-V International, which
has more than 3,000 members and which reported that more
than 10 billion chips containing RISC-V cores had shipped
by the end of 2022. Many implementations of RISC-V are
available, both as open-source cores and as commercial
IP products.
You call that compiler-oriented???
It depends on how many are being programmed by the likes of GCC.
When ATMEL hit the market the manufacturer claimed their chips
were designed with compilers in mind. Do Arduino users program
in hand-coded assembler? Do you? It's no longer just the chip's
features and theoretical performance one has to worry about but
the compilers too.
On Mon, 14 Jul 2025 7:50:04 +0000, Anton Ertl wrote:
C) Perform tree addition
a) Using 80-bit addition. This will be faster than sequential
addition because in many cases several additions can run in
parallel. It will also be quite accurate because it uses 80-bit
addition, and because the addition chains are reduced to
ld(length(vector)).
This looks very interesting. I can find Kahan and Neumaier, but
"tree addition" didn't turn up (There is a suspicious looking
reliability paper about the approach which surely is not what
you meant). Or is it pairwise addition what I should look for?
I did not do any accuracy measurements, but I did performance
measurements on a Ryzen 5800X:
cycles:u
gforth-fast iforth lxf SwiftForth VFX 3_057_979_501 6_482_017_334 6_087_130_593 6_021_777_424 6_034_560_441 NAI 6_601_284_920 6_452_716_125 7_001_806_497 6_606_674_147 6_713_703_069 UNR 3_787_327_724 2_949_273_264 1_641_710_689 7_437_654_901 1_298_257_315 REC 9_150_679_812 14_634_786_781 SR
cycles:u
gforth-fast iforth lxf SwiftForth VFX
13_113_842_702 6_264_132_870 9_011_308_923 11_011_828_048 8_072_637_768 NAI
6_802_702_884 2_553_418_501 4_238_099_417 11_277_658_203 3_244_590_981 UNR 9_370_432_755 4_489_562_792 4_955_679_285 12_283_918_226 3_915_367_813 REC
51_113_853_111 29_264_267_850 SR
I did not do any accuracy measurements, but I did performanceYMMV but "fast but wrong" would not be my goal. ;-)
measurements
But I decided to use a recursive approach (recursive-sum, REC) that
uses the largest 2^k<n as the left child and the rest as the right
child, and as base cases for the recursion use a straight-line
balanced-tree evaluation for 2^k with k<=7 (and combine these for n
that are not 2^k). For systems with tiny FP stacks, I added the
option to save intermediate results on a software stack in the
recursive word. Concerning the straight-line code, it turned out that
the highest k I could use on sf64 and vfx64 is 5 (corresponding to 6
FP stack items); it's not clear to me why; on lxf I can use k=7 (and
it uses the 387 stack, too).
Am 16.07.2025 um 13:25 schrieb Anton Ertl:
I did not do any accuracy measurements, but I did performanceYMMV but "fast but wrong" would not be my goal. ;-)
measurements
minforth <minforth@gmx.net> writes:
Am 16.07.2025 um 13:25 schrieb Anton Ertl:
I did not do any accuracy measurements, but I did performanceYMMV but "fast but wrong" would not be my goal. ;-)
measurements
I did test correctness with cases where roundoff errors do not play a
role.
As mentioned, the RECursive balanced-tree sum (which is also the
fastest on several systems and absolutely) is expected to be more
accurate in those cases where roundoff errors do play a role. But if
you care about that, better design a test and test it yourself. It
will be interesting to see how you find out which result is more
accurate when they differ.
It depends on how many are being programmed by the likes of GCC.
When ATMEL hit the market the manufacturer claimed their chips
were designed with compilers in mind. Do Arduino users program
in hand-coded assembler? Do you? It's no longer just the chip's
features and theoretical performance one has to worry about but
the compilers too.
Regarding features it's worth to mention
that ATMELs actually are quite nice to
program them in ML. Even, if they were
designed "with compilers in mind".
...
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:I have run this test now on my Ryzen 9950X for lxf, lxf64 ans a snapshot of gforth
I did not do any accuracy measurements, but I did performance
measurements on a Ryzen 5800X:
cycles:u
gforth-fast iforth lxf SwiftForth VFX 3_057_979_501 6_482_017_334 6_087_130_593 6_021_777_424 6_034_560_441 NAI
6_601_284_920 6_452_716_125 7_001_806_497 6_606_674_147 6_713_703_069 UNR
3_787_327_724 2_949_273_264 1_641_710_689 7_437_654_901 1_298_257_315 REC
9_150_679_812 14_634_786_781 SR
cycles:u
This second table is about instructions:u
gforth-fast iforth lxf SwiftForth VFX
13_113_842_702 6_264_132_870 9_011_308_923 11_011_828_048 8_072_637_768 NAI
6_802_702_884 2_553_418_501 4_238_099_417 11_277_658_203 3_244_590_981 UNR
9_370_432_755 4_489_562_792 4_955_679_285 12_283_918_226 3_915_367_813 REC
51_113_853_111 29_264_267_850 SR
- anton
Reminds me of the 6502 for some reason. But it's the 'skip next
instruction on bit in register' that throws me.
Didn't get that in the good old days as products were expected to
have a reasonable lifetime. Today CPU designs are as 'throw away'
as everything else. No reason to believe RISC-V will be different.
Only thing distinguishing it are the years of hype and promise.
Well, that is strange ...
Results with the current iForth are quite different:
FORTH> bench ( see file quoted above + usual iForth timing words )
\ 7963 times
\ naive-sum : 0.999 seconds elapsed. ( 4968257259 )
\ unrolled-sum : 1.004 seconds elapsed. ( 4968257259 )
\ recursive-sum : 0.443 seconds elapsed. ( 4968257259 )
\ shift-reduce-sum : 2.324 seconds elapsed. ( 4968257259 ) ok
Ryzen 9950X
lxf64
5,010,566,495 NAI cycles:u
2,011,359,782 UNR cycles:u
646,926,001 REC cycles:u
3,589,863,082 SR cycles:u
lxf64 =20
7,019,247,519 NAI instructions:u =20
4,128,689,843 UNR instructions:u =20
4,643,499,656 REC instructions:u=20
25,019,182,759 SR instructions:u=20
gforth-fast 20250219
2,048,316,578 NAI cycles:u
7,157,520,448 UNR cycles:u
3,589,638,677 REC cycles:u
17,199,889,916 SR cycles:u
gforth-fast 20250219
13,107,999,739 NAI instructions:u=20
6,789,041,049 UNR instructions:u
9,348,969,966 REC instructions:u=20
50,108,032,223 SR instructions:u=20
lxf
6,005,617,374 NAI cycles:u
6,004,157,635 UNR cycles:u
1,303,627,835 REC cycles:u
9,187,422,499 SR cycles:u
lxf
9,010,888,196 NAI instructions:u
4,237,679,129 UNR instructions:u=20
4,955,258,040 REC instructions:u=20
26,018,680,499 SR instructions:u
lxf uses the x87 builtin fp stack, lxf64 uses sse4 and a large fp stack=20
Meanwhile many years ago, comparative tests were carried out with a
couple of representative archived serial data (~50k samples)
Ultimately, Kahan summation
was the winner. It is slow, but there were no in-the-loop
requirements, so for a background task, Kahan was fast enough.
minforth <minforth@gmx.net> writes:
Meanwhile many years ago, comparative tests were carried out with a
couple of representative archived serial data (~50k samples)
Representative of what? Serial: what series?
peter <peter.noreply@tin.it> writes:
Ryzen 9950X
lxf64
5,010,566,495 NAI cycles:u
2,011,359,782 UNR cycles:u
646,926,001 REC cycles:u
3,589,863,082 SR cycles:u
lxf64 =20
7,019,247,519 NAI instructions:u =20
4,128,689,843 UNR instructions:u =20
4,643,499,656 REC instructions:u=20
25,019,182,759 SR instructions:u=20
gforth-fast 20250219
2,048,316,578 NAI cycles:u
7,157,520,448 UNR cycles:u
3,589,638,677 REC cycles:u
17,199,889,916 SR cycles:u
gforth-fast 20250219
13,107,999,739 NAI instructions:u=20
6,789,041,049 UNR instructions:u
9,348,969,966 REC instructions:u=20
50,108,032,223 SR instructions:u=20
lxf
6,005,617,374 NAI cycles:u
6,004,157,635 UNR cycles:u
1,303,627,835 REC cycles:u
9,187,422,499 SR cycles:u
lxf
9,010,888,196 NAI instructions:u
4,237,679,129 UNR instructions:u=20
4,955,258,040 REC instructions:u=20
26,018,680,499 SR instructions:u
lxf uses the x87 builtin fp stack, lxf64 uses sse4 and a large fp stack=20
Apparently the latency of ADDSD (SSE2) is down to 2 cycles on Zen5
(visible in lxf64 UNR and gforth-fast NAI) while the latency of FADD
(387) is still 6 cycles (lxf NAI and UNR). I have no explanation why
on lxf64 NAI performs so much worse than UNR, and in gforth-fast UNR
so much worse than NAI.
For REC the latency should not play a role. There lxf64 performs at
7.2IPC and 1.55 F+/cycle, whereas lxf performs only at 3.8IPC and 0.77 F+/cycle. My guess is that FADD can only be performed by one FPU, and
that's connected to one dispatch port, and other instructions also
need or are at least assigned to this dispatch port.
- anton
mhx@iae.nl (mhx) writes:[..]
Well, that is strange ...
The output should be the approximate number of seconds. Here's what I
get from the cycle:u numbers for iForth 5.1-mini given in the earlier postings:
\ ------------ input ---------- | output
6_482_017_334 scale 7 5 3 f.rdp 1.07534 ok
6_452_716_125 scale 7 5 3 f.rdp 1.07048 ok
2_949_273_264 scale 7 5 3 f.rdp 0.48927 ok
14_634_786_781 scale 7 5 3 f.rdp 2.42785 ok
The resulting numbers are not very different from those you show. My measurements include iForth's startup overhead, which may be one
explanation why they are a little higher.
dxf <dxforth@gmail.com> writes:
On 13/07/2025 7:01 pm, Anton Ertl wrote:
...
For Forth, Inc. and MPE AFAIK their respective IA-32 Forth system was
the only one with hardware FP for many years, so there probably was
little pressure from users for bit-identical results with, say, SPARC,
because they did not have a Forth system that ran on SPARC.
What do you mean by "bit-identical results"? Since SSE2 comes without
transcendentals (or basics such as FABS and FNEGATE) and implementers
are expected to supply their own, if anything, I expect results across
platforms and compilers to vary.
There are operations for which IEEE 754 specifies the result to the
last bit (except that AFAIK the representation of NaNs is not
specified exactly), among them F+ F- F* F/ FSQRT, probably also
FNEGATE and FABS. It does not specify the exact result for
transcendental functions, but if your implementation performs the same bit-exact operations for computing a transcendental function on two
IEEE 754 compliant platforms, the result will be bit-identical (if it
is a number). So just use the same implementations of transcentental functions, and your results will be bit-identical; concerning the
NaNs, if you find a difference, check if the involved values are NaNs.
So in mandating bit-identical results, not only in calculations but also >input/output
IEEE 754 is all about giving the illusion of truth in
floating-point when, if anything, they should be warning users don't be >fooled.
I did a test coding the sum128 as a code word with avx-512 instructions
and got the following results
285,584,376 cycles:u
941,856,077 instructions:u
timing was
timer-reset ' recursive-sum bench .elapsed 51 ms elapsed
so half the time of the original recursive.
with 32 zmm registers I could have done a sum256 also
; Horizontal sum of zmm0
vextractf64x4 ymm1, zmm0, 1
vaddpd ymm2, ymm1, ymm0
vextractf64x2 xmm3, ymm2, 1
vaddpd ymm4, ymm3, ymm2
vhaddpd xmm0, xmm4, xmm4
One way to deal with all that would be to have a long-vector stack and
have something like my vector wordset
<https://github.com/AntonErtl/vectors>, where the sum of a vector
would be a word that is implemented in some lower-level way (e.g.,
assembly language); the sum of a vector is actually a planned, but not
yet existing feature of this wordset.
peter <peter.noreply@tin.it> writes:
I did a test coding the sum128 as a code word with avx-512 instructions
and got the following results
285,584,376 cycles:u
941,856,077 instructions:u
timing was
timer-reset ' recursive-sum bench .elapsed 51 ms elapsed
so half the time of the original recursive.
with 32 zmm registers I could have done a sum256 also
One could do sum128 with just 8 registers by performing the adds ASAP,
i.e., for sum32
vmovapd zmm0, [rbx]
vmovapd zmm1, [rbx+64]
vaddpd zmm0, zmm0, zmm1
vmovapd zmm1, [rbx+128]
vmovapd zmm2, [rbx+192]
vaddpd zmm1, zmm1, zmm2
vaddpd zmm0, zmm0, zmm1
; and then the Horizontal sum
And you can code this as:
vmovapd zmm0, [rbx]
vaddpd zmm0, zmm0, [rbx+64]
vmovapd zmm1, [rbx+128]
vaddpd zmm1, zmm1, [rbx+192]
vaddpd zmm0, zmm0, zmm1
; and then the Horizontal sum
; Horizontal sum of zmm0
vextractf64x4 ymm1, zmm0, 1
vaddpd ymm2, ymm1, ymm0
vextractf64x2 xmm3, ymm2, 1
vaddpd ymm4, ymm3, ymm2
vhaddpd xmm0, xmm4, xmm4
Instead of doing the horizontal sum once for every sum128, it might be
more efficient (assuming the whole thing is not
cache-bandwidth-limited) to have the result of sum128 be a full SIMD
width, and then add them up with vaddpd instead of addsd, and do the horizontal sum once in the end.
But if the recursive part is to be programmed in Forth, we would need
a way to represent a SIMD width of data in Forth, maybe with a SIMD
stack. I see a few problems there:
* What to do about the mask registers of AVX-512? In the RISC-V
vector extension masks are stored in regular SIMD registers.
* There is a trend visible in ARM SVE and the RISC-V Vector extension
to have support for dealing with loops across longer vectors. Do we
also need to support something like that.
For the RISC-V vector extension, see <https://riscv.org/wp-content/uploads/2024/12/15.20-15.55-18.05.06.VEXT-bcn-v1.pdf>
One way to deal with all that would be to have a long-vector stack and
have something like my vector wordset
<https://github.com/AntonErtl/vectors>, where the sum of a vector
would be a word that is implemented in some lower-level way (e.g.,
assembly language); the sum of a vector is actually a planned, but not
yet existing feature of this wordset.
An advantage of having a (short) SIMD stack would be that one could
use SIMD operations for other uses where the long-vector wordset looks
too heavy-weight (or would need optimizations to get rid of the
long-vector overhead). The question is if enough such uses exist to
justify adding such a stack.
- anton
On Sat, 19 Jul 2025 10:18:15 GMT[sum32][
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
vmovapd zmm0, [rbx]
vaddpd zmm0, zmm0, [rbx+64]
vmovapd zmm1, [rbx+128]
vaddpd zmm1, zmm1, [rbx+192]
vaddpd zmm0, zmm0, zmm1
; and then the Horizontal sum
; Horizontal sum of zmm0
vextractf64x4 ymm1, zmm0, 1
vaddpd ymm2, ymm1, ymm0
vextractf64x2 xmm3, ymm2, 1
vaddpd ymm4, ymm3, ymm2
vhaddpd xmm0, xmm4, xmm4
the simd instructions does also take a memory operand
I can du sum128 as
code asum128b
movsd [r13-0x8], xmm0
lea r13, [r13-0x8]
vmovapd zmm0, [rbx]
vaddpd zmm0, zmm0, [rbx+64]
vaddpd zmm0, zmm0, [rbx+128]
vaddpd zmm0, zmm0, [rbx+192]
vaddpd zmm0, zmm0, [rbx+256]
vaddpd zmm0, zmm0, [rbx+320]
vaddpd zmm0, zmm0, [rbx+384]
vaddpd zmm0, zmm0, [rbx+448]
vaddpd zmm0, zmm0, [rbx+512]
vaddpd zmm0, zmm0, [rbx+576]
vaddpd zmm0, zmm0, [rbx+640]
vaddpd zmm0, zmm0, [rbx+704]
vaddpd zmm0, zmm0, [rbx+768]
vaddpd zmm0, zmm0, [rbx+832]
vaddpd zmm0, zmm0, [rbx+896]
vaddpd zmm0, zmm0, [rbx+960]
Am 19.07.2025 um 12:18 schrieb Anton Ertl:
One way to deal with all that would be to have a long-vector stack and
have something like my vector wordset
<https://github.com/AntonErtl/vectors>, where the sum of a vector
would be a word that is implemented in some lower-level way (e.g.,
assembly language); the sum of a vector is actually a planned, but not
yet existing feature of this wordset.
Not wanting to sound negative, but who in practice adds up long
vectors, apart from testing compilers and fp-arithmetic?
Dot products, on the other hand, are fundamental for many linear
algebra algorithms, eg. matrix multiplication and AI.
dxf <dxforth@gmail.com> writes:
So in mandating bit-identical results, not only in calculations but also
input/output
I don't think that IEEE 754 specifies I/O, but I could be wrong.
IEEE 754 is all about giving the illusion of truth in
floating-point when, if anything, they should be warning users don't be
fooled.
I don't think that IEEE 754 mentions truth. It does, however, specify
the inexact "exception" (actually a flag), which allows you to find
out if the results of the computations are exact or if some rounding
was involved.
minforth <minforth@gmx.net> writes:
Am 19.07.2025 um 12:18 schrieb Anton Ertl:
One way to deal with all that would be to have a long-vector stack and
have something like my vector wordset
<https://github.com/AntonErtl/vectors>, where the sum of a vector
would be a word that is implemented in some lower-level way (e.g.,
assembly language); the sum of a vector is actually a planned, but not
yet existing feature of this wordset.
Not wanting to sound negative, but who in practice adds up long
vectors, apart from testing compilers and fp-arithmetic?
Everyone who does dot-products.
Dot products, on the other hand, are fundamental for many linear
algebra algorithms, eg. matrix multiplication and AI.
If I add a vector-sum word
df+red ( dfv -- r )
\ r is the sum of the elements of dfv
to the vector wordset, then the dot-product is:
: dot-product ( dfv1 dfv2 -- r )
df*v df+red ;
Concerning matrix multiplication, while you can use the dot-product
for it, there are many other ways to do it, and some are more
efficient (although, admittedly, I have not used pairwise addition for
these ways).
AFAICS IEEE 754 offers nothing particularly useful for the end-user.
Either one's fp application works - or it doesn't. IEEE hasn't
changed that.
IEEE's relevance is that it spurred Intel into making an FPU which in
turn made implementing fp easy.
dxf <dxforth@gmail.com> writes:
AFAICS IEEE 754 offers nothing particularly useful for the end-user.
Either one's fp application works - or it doesn't. IEEE hasn't
changed that.
The purpose of IEEE FP was to improve the numerical accuracy of
applications that used it as opposed to other formats.
IEEE's relevance is that it spurred Intel into making an FPU which in
turn made implementing fp easy.
Exactly the opposite, Intel decided that it wanted to make an FPU and it wanted the FPU to have the best FP arithmetic possible. So it
commissioned Kahan (a renowned FP expert) to design the FP format.
Kahan said "Why not use the VAX format? It is pretty good". Intel said
it didn't want pretty good, it wanted the best, so Kahan said "ok" and designed the 8087 format.
The IEEE standardization process happened AFTER the 8087 was already in progress. Other manufacturers signed onto it, some of them overcoming initial resistance, after becoming convinced that it was the right
thing.
http://people.eecs.berkeley.edu/~wkahan/ieee754status/754story.html
mhx wrote:
On Sun, 6 Oct 2024 7:51:31 +0000, dxf wrote:
Is there an easier way of doing this? End goal is a double number representing centi-secs.
empty decimal
: SPLIT ( a u c -- a2 u2 a3 u3 ) >r 2dup r> scan 2swap 2 pick - ;
: >INT ( adr len -- u ) 0 0 2swap >number 2drop drop ;
: /T ( a u -- $hour $min $sec )
2 0 do [char] : split 2swap dup if 1 /string then loop
2 0 do dup 0= if 2rot 2rot then loop ;
: .T 2swap 2rot cr >int . ." hr " >int . ." min " >int . ." sec " ;
s" 1:2:3" /t .t
s" 02:03" /t .t
s" 03" /t .t
s" 23:59:59" /t .t
s" 0:00:03" /t .t
Why don't you use the fact that >NUMBER returns the given
string starting with the first unconverted character?
SPLIT should be redundant.
-marcel
: CHAR-NUMERIC? 48 58 WITHIN ;
: SKIP-NON-NUMERIC ( adr u -- adr2 u2)
BEGIN
DUP IF OVER C@ CHAR-NUMERIC? NOT ELSE 0 THEN
WHILE
1 /STRING
REPEAT ;
: SCAN-NEXT-NUMBER ( n adr len -- n2 adr2 len2)
2>R 60 * 0. 2R> >NUMBER
2>R D>S + 2R> ;
: PARSE-TIME ( adr len -- seconds)
0 -ROT
BEGIN
SKIP-NON-NUMERIC
DUP
WHILE
SCAN-NEXT-NUMBER
REPEAT
2DROP ;
S" hello 1::36 world" PARSE-TIME CR .
96 ok
: get-number ( accum adr len -- accum' adr' len' )
{ adr len }
0. adr len >number { adr' len' }
len len' =
if
2drop adr len 1 /string
else
d>s swap 60 * +
adr' len'
then ;
: parse-time ( adr len -- seconds)
0 -rot
begin
dup
while
get-number
repeat
2drop ;
s" foo-bar" parse-time . 0
s" foo55bar" parse-time . 55
s" foo 1 bar 55 zoo" parse-time . 155
...
: get-number ( accum adr len -- accum' adr' len' )
{ adr len }
0. adr len >number { adr' len' }
len len' =
if
2drop adr len 1 /string
else
d>s swap 60 * +
adr' len'
then ;
: parse-time ( adr len -- seconds)
0 -rot
begin
dup
while
get-number
repeat
2drop ;
s" foo-bar" parse-time . 0
s" foo55bar" parse-time . 55
s" foo 1 bar 55 zoo" parse-time . 155
s" and9foo 1 bar 55 zoo" parse-time . 32515
| Sysop: | DaiTengu |
|---|---|
| Location: | Appleton, WI |
| Users: | 1,089 |
| Nodes: | 10 (0 / 10) |
| Uptime: | 155:22:07 |
| Calls: | 13,921 |
| Calls today: | 2 |
| Files: | 187,021 |
| D/L today: |
3,936 files (996M bytes) |
| Messages: | 2,457,200 |