• Re: Parsing timestamps?

    From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Mon Jul 7 11:30:10 2025
    From Newsgroup: comp.lang.forth

    In article <2025Jul6.133027@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    <SNIP>
    Skip Carter did not post in this thread, but given that he proposed
    the change, he probably found 6 to be too few; or maybe it was just a >phenomenon that we also see elsewhere as range anxiety. In any case,
    he made no such proposal to Forth-200x, so apparently the need was not >pressing.

    Note that the vast experience Wagner has, trumps the anxiety others
    may or may not have.

    <SNIP>
    In any case, in almost all cases I use the default FP pack, and here
    the VFX-5 and SwiftForth-4 approach is unbeatable in simplicity.
    Instead of performing the sequence of commands shown above, I just
    start the Forth system, and FP words are ready.

    And even
    WANT -fp-
    is not much of a hassle in ciforth.

    <SNIP>


    - anton

    Groetjes Albert
    --
    The Chinese government is satisfied with its military superiority over USA.
    The next 5 year plan has as primary goal to advance life expectancy
    over 80 years, like Western Europe.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Mon Jul 7 13:21:36 2025
    From Newsgroup: comp.lang.forth

    On 07-07-2025 05:48, dxf wrote:
    On 6/07/2025 9:30 pm, Anton Ertl wrote:
    dxf <dxforth@gmail.com> writes:
    On 5/07/2025 6:49 pm, Anton Ertl wrote:
    dxf <dxforth@gmail.com> writes:
    [8 stack items on the FP stack]
    Puzzling because of a thread here not long ago in which scientific users >>>>> appear to suggest the opposite. Such concerns have apparently been around
    a long time:

    https://groups.google.com/g/comp.lang.forth/c/CApt6AiFkxo/m/wwZmc_Tr1PcJ >>>>
    I have read through the thread. It's unclear to me which scientific
    users you have in mind. My impression is that 8 stack items was
    deemed sufficient by many, and preferable (on 387) for efficiency
    reasons.

    AFAICS both Skip Carter (proponent) and Julian Noble were suggesting the >>> 6 level minimum were inadequate.

    Skip Carter did not post in this thread, but given that he proposed
    the change, he probably found 6 to be too few; or maybe it was just a
    phenomenon that we also see elsewhere as range anxiety. In any case,
    he made no such proposal to Forth-200x, so apparently the need was not
    pressing.

    Julian Noble ignored the FP stack size issue in his first posting in
    this thread, unlike the separate FP stack size issue, which he
    supported. So it seems that he did not care about a larger FP stack
    size. In the other posting he endorsed moving FP stack items to the
    data stack, but he did not write why; for all we know he might have
    wanted that as a first step for getting the mantissa, exponent and
    sign of the FP value as integer (and the other direction for
    synthesizing FP numbers from these parts).

    He appears to dislike the idea of standard-imposed minimums (e.g. Carter's suggestion of 16) but suggested:

    a) the user can offload to memory if necessary from
    fpu hardware;

    b) an ANS FLOATING and FLOATING EXT wordset includes
    the necessary hooks to extend the fp stack.

    In 4tH, there are two (highlevel) FP-systems - with 6 predetermined configurations. Configs number 0-2 don't have an FP stack, they use the datastack. 3-5 have a separate FP stack - and double the precision. The standard FP stacksize is 16, you can extend it by defining a constant
    before including the FP libs.

    Hans Bezemer

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Mon Jul 7 14:31:03 2025
    From Newsgroup: comp.lang.forth

    On 03-07-2025 18:47, Ruvim wrote:
    On 2025-07-03 17:11, albert@spenarnc.xs4all.nl wrote:
    In article <1043831$3ggg9$1@dont-email.me>,
    Ruvim  <ruvim.pinka@gmail.com> wrote:
    On 2025-07-02 15:37, albert@spenarnc.xs4all.nl wrote:
    In article <1042s2o$3d58h$1@dont-email.me>,
    Ruvim  <ruvim.pinka@gmail.com> wrote:
    On 2025-06-24 01:03, minforth wrote:
    [...]

    For me, the small syntax extension is a convenience when working
    with longer definitions. A bit contrived (:= synonym for TO):

    : SOME-APP { a f: b c | temp == n: flag z: freq }
    \ inputs: integer a, floats b c
    \ uninitialized: float temp
    \ outputs: integer flag, complex freq
        <: FUNC < ... calc function ... > ;>

    BTW, why do you prefer the special syntax `<: ... ;>`
    over an extension to the existing words `:` and `;`

        : SOME-APP
           [ : FUNC < ... calc function ... > ; ]
           < ... >
        ;

    In this approach the word `:` knows that it's a nested definition and >>>>> behaves accordingly.

    Or it has not even know it, if [ is smart enough to compile a jump to
    after ].

    This can be tricky because the following should work:

       create foo [ 123 , ] [ 456 ,

       : bar  [ ' foo compile, 123 lit, ] ;

    If this bothers you, rename it in [[ ]].

    Once we enhance [ ] to do things prohibited by the standard,
    (adding nested definitions) I can't be bothered with this too much.


    The standard does not prohibit a system from supporting nested
    definitions in whichever way that does not violate the standard behavior.


    Yes, something like "private[ ... ]private" is a possible approach, and
    its implementation seems simpler than adding the smarts to `:` and `;`
    (and other defining words, if any).

    The advantage of this approach over "<: ... ;>" is that you can define
    not only colon-definitions, but also constants, variables, immediate
    words, one-time macros, etc.


      : foo ( F: r.coefficient -- r.result )
        private[
          variable cnt
          0e fvalue k
          : [x] ... ; immediate
        ]private
        to k   0 cnt !
        ...
      ;

    It's also possible to associated the word list of private words with the containing word xt for debugging purposes.

    4tH has always allowed it, since it considered : and ; as branches -
    like AHEAD. Since [: and ;] are just :NONAME and ; aliases they work essentially the same.

    I never used it, because it would cause portability issues - and I
    considered it "bad style".

    The same goes for allocation (VARIABLE, VALUE, STRING, ARRAY). These are
    in 4tH basically just directives - NOTHING IS ACTUALLY ALLOCATED. That
    works just fine.

    So the whole shebang would practically work out of the box. But of
    course, to follow the complete example, I had to do the FP stuff as well
    - and I wanted to do a bit of protecting the words between the tags.

    In short, that boils down to this (for FOO):

    1018| branch 1036 foo
    1019| literal 1020
    1020| branch 1030 <== jump after opcode 1030
    1021| literal 0 <== mantissa 0
    1022| literal 0 <== exponent 0
    1023| variable 2 *k_f
    1024| call 0 2!
    1025| branch 1027 k <== DOES> definition
    1026| variable 2 *k_f
    1027| branch 7 2@ <== end DOES> def.
    1028| branch 1029 [x]
    1029| exit 0
    1030| exit 0 <== end of "private"
    1031| drop 0 <== drop the XT
    1032| variable 2 *k_f
    1033| call 0 2!
    1034| literal 0
    1035| to 1 cnt
    1036| exit 0

    Which is the compilant of this (preprocessor) code:

    :macro ... ;
    :macro private[ [: ;
    :macro ]private ;] drop ;
    :macro 0e 0 S>F ;

    include lib/fp1.4th
    include 4pp/lib/fvalue.4pp

    : foo ( F: r.coefficient -- r.result )
    private[
    variable cnt
    0e fvalue k
    : [x] ... ; immediate
    ]private
    fto k 0 cnt !
    ...
    ;

    100 s>f foo k f. cnt ? cr

    The weird code generated from 1021 - 1027 is the result of this code (generated by the preprocessor):

    0 S>F FLOAT ARRAY k
    AKA k *k_f LATEST F! :REDO k F@ ;

    1. FP "zero" is thrown on the stack;
    2. A variable with the capacity of a "float" is created (FVARIABLE);
    3. Copy that symbol in the symbol table;
    4. Initialize the FP variable;
    5. Create a "DOES>" definition that fetches that value.

    But it *does* work as advertised..

    Hans Bezemer
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From dxf@dxforth@gmail.com to comp.lang.forth on Tue Jul 8 13:17:28 2025
    From Newsgroup: comp.lang.forth

    On 7/07/2025 9:21 pm, Hans Bezemer wrote:
    On 07-07-2025 05:48, dxf wrote:
    ...
    He appears to dislike the idea of standard-imposed minimums (e.g. Carter's >> suggestion of 16) but suggested:

       a) the user can offload to memory if necessary from
       fpu hardware;

       b) an ANS FLOATING and FLOATING EXT wordset includes
       the necessary hooks to extend the fp stack.

    In 4tH, there are two (highlevel) FP-systems - with 6 predetermined configurations. Configs number 0-2 don't have an FP stack, they use the datastack. 3-5 have a separate FP stack - and double the precision. The standard FP stacksize is 16, you can extend it by defining a constant before including the FP libs.

    Given the ANS minimum of 6 and recognizing that merely displaying an fp number can
    consume several positions I added another 5 as headroom. I organized it such that
    if the interpreter or ?STACK was invoked, anything more than 6 would report the overflow:

    DX-Forth 4.60 2025-06-25

    Software floating-point (separate stack)

    1e 2e 3e 4e 5e 6e .s 1. 2. 3. 4. 5. 6.000001 <f ok

    7e 7e f-stack?

    Whether it was worth it is hard to say. OTOH users are free to patch those limits
    to anything they like and do a COLD or SAVE-SYSTEM to enact the change.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Wed Jul 9 15:10:30 2025
    From Newsgroup: comp.lang.forth

    dxf <dxforth@gmail.com> writes:
    As for SSE2 it wouldn't exist if industry didn't consider
    double-precision adequate.

    SSE2 is/was first and foremost a vectorizing extension, and it has been superseded quite a few times, indicating it was never all that
    adequate. I don't know whether any of its successors support extended precision though.

    W. Kahan was a big believer in extended precision (that's why the 8087
    had it from the start). I believes IEEE specifies both 80 bit and 128
    bit formats in addition to 64 bit. The RISC-V spec includes encodings
    for 128 bit IEEE but I don't know if any RISC-V hardware actually
    implements it. I think there are some IBM mainframe CPUs that have it.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From minforth@minforth@gmx.net to comp.lang.forth on Thu Jul 10 02:18:50 2025
    From Newsgroup: comp.lang.forth

    Am 10.07.2025 um 00:10 schrieb Paul Rubin:
    dxf <dxforth@gmail.com> writes:
    As for SSE2 it wouldn't exist if industry didn't consider
    double-precision adequate.

    SSE2 is/was first and foremost a vectorizing extension, and it has been superseded quite a few times, indicating it was never all that
    adequate. I don't know whether any of its successors support extended precision though.

    You don't need 64-bit doubles for signal or image processing.
    Most vector/matrix operations on streaming data don't require
    them either. Whether SSE2 is adequate or not to handle such data
    depends on the application. "Industry" can manage well with 32-bit
    floats or even smaller with non-standard number formats.

    The AVX extension introduced YMM registers that can do simultaneous
    math on four 64-bit double-precision floating-point numbers.
    The intended application domain was scientific computing.

    The determining factors are data througput and storage space.
    Today, with GPUs, speed and power consumption, driven by AI.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From dxf@dxforth@gmail.com to comp.lang.forth on Thu Jul 10 14:16:18 2025
    From Newsgroup: comp.lang.forth

    On 10/07/2025 8:10 am, Paul Rubin wrote:
    dxf <dxforth@gmail.com> writes:
    As for SSE2 it wouldn't exist if industry didn't consider
    double-precision adequate.

    SSE2 is/was first and foremost a vectorizing extension, and it has been superseded quite a few times, indicating it was never all that
    adequate. I don't know whether any of its successors support extended precision though.

    W. Kahan was a big believer in extended precision (that's why the 8087
    had it from the start). I believes IEEE specifies both 80 bit and 128
    bit formats in addition to 64 bit. The RISC-V spec includes encodings
    for 128 bit IEEE but I don't know if any RISC-V hardware actually
    implements it. I think there are some IBM mainframe CPUs that have it.

    I suspect IEEE simply standardized what had become common practice among implementers. By using 80 bits /internally/ Intel went a long way to
    achieving IEEE's spec for double precision.

    What little I know about SSE2 it's not as well thought out or organized
    as Intel's original effort. E.g. doing something as simple as changing
    sign of an fp number is a pain when NANs are factored in. With the x87,
    Intel 'got it right the first time'. Except for the stack size and
    efforts to fix it.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Wed Jul 9 21:32:42 2025
    From Newsgroup: comp.lang.forth

    minforth <minforth@gmx.net> writes:
    You don't need 64-bit doubles for signal or image processing.
    Most vector/matrix operations on streaming data don't require
    them either. Whether SSE2 is adequate or not to handle such data
    depends on the application.

    Sure, and for that matter, AI inference uses 8 bit and even 4 bit
    floating point. Kahan on the other hand was interested in engineering
    and scientific applications like PDE solvers (airfoils, fluid dynamics,
    FEM, etc.). That's an area where roundoff error builds up after many iterations, thus extended precision.

    "Industry" can manage well with 32-bit floats or even smaller with non-standard number formats.

    Depends on your notion of "industry".
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Wed Jul 9 21:35:20 2025
    From Newsgroup: comp.lang.forth

    dxf <dxforth@gmail.com> writes:
    I suspect IEEE simply standardized what had become common practice among implementers.

    No, it was really new and interesting. https://people.eecs.berkeley.edu/~wkahan/ieee754status/754story.html

    What little I know about SSE2 it's not as well thought out or organized
    as Intel's original effort. E.g. doing something as simple as changing
    sign of an fp number is a pain when NANs are factored in.

    I wonder if later SSE/AVX/whatever versions fixed this stuff.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From minforth@minforth@gmx.net to comp.lang.forth on Thu Jul 10 07:37:02 2025
    From Newsgroup: comp.lang.forth

    Am 10.07.2025 um 06:32 schrieb Paul Rubin:
    minforth <minforth@gmx.net> writes:
    You don't need 64-bit doubles for signal or image processing.
    Most vector/matrix operations on streaming data don't require
    them either. Whether SSE2 is adequate or not to handle such data
    depends on the application.

    Sure, and for that matter, AI inference uses 8 bit and even 4 bit
    floating point.

    Or fuzzy control for instance.

    Kahan on the other hand was interested in engineering
    and scientific applications like PDE solvers (airfoils, fluid dynamics,
    FEM, etc.). That's an area where roundoff error builds up after many iterations, thus extended precision.


    That's why I use Kahan summation for dot products. It is slow but
    rounding error accumulation remains small. A while ago I read an
    article about this issue in which the author(s) performed extensive tests
    of different dot product calculation algorithms on many serial
    data sets from finance, geology, oil industry, meteorology etc.
    Their target criterion was to find an acceptable balance between
    computational speed and minimal error.

    The 'winner' was a chained fused-multiply-add algorithm (many
    CPUs/GPUs can perform FMA in hardware) which makes for shorter code
    (good for caching). And it supports speed improvement by
    parallelization (recursive halving of the sets until manageable
    vector size followed by parallel computation).

    I don't do parallelization, but I was still surprised by the good
    results using FMA. In other words, increasing floating-point number
    size is not always the way to go. Anyhow, first step is to select
    the best fp rounding method ....
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From dxf@dxforth@gmail.com to comp.lang.forth on Thu Jul 10 15:56:30 2025
    From Newsgroup: comp.lang.forth

    On 10/07/2025 2:35 pm, Paul Rubin wrote:
    dxf <dxforth@gmail.com> writes:
    I suspect IEEE simply standardized what had become common practice among
    implementers.

    No, it was really new and interesting. https://people.eecs.berkeley.edu/~wkahan/ieee754status/754story.html

    What little I know about SSE2 it's not as well thought out or organized
    as Intel's original effort. E.g. doing something as simple as changing
    sign of an fp number is a pain when NANs are factored in.

    I wonder if later SSE/AVX/whatever versions fixed this stuff.

    Actually I was wrong. x87 FCHS (aka FNEGATE) changes the sign bit of a NAN. IEEE doesn't consider NANs to be signed even though x87 implementations may display them that way. The catch with SSE is there's nothing like FCHS or FABS so depending on how one implements them, results vary across implementations.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Wed Jul 9 22:59:00 2025
    From Newsgroup: comp.lang.forth

    minforth <minforth@gmx.net> writes:
    I don't do parallelization, but I was still surprised by the good
    results using FMA. In other words, increasing floating-point number
    size is not always the way to go.

    Kahan was an expert in clever numerical algorithms that avoid roundoff
    errors, Kahan summation being one such algorithm. But he realized that
    most programmers don't have the numerics expertise to come up with
    schemes like that. A simpler and usually effective way to avoid
    roundoff error swamping the result is simply to use double or extended precision. So that is what he often suggested.

    Here's an example of a FEM calculation that works well with 80 bit but
    poorly with 64 bit FP:

    https://people.eecs.berkeley.edu/~wkahan/Cantilever.pdf

    Anyhow, first step is to select the best fp rounding method ....

    Kahan advised compiling the program three times, once for each IEEE
    rounding mode. Run all three programs and see if the outputs differ by
    enough to care about. If they do, you have some precision loss to deal
    with somehow, possibly by use of wider floats.

    https://people.eecs.berkeley.edu/~wkahan/Mindless.pdf
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Thu Jul 10 07:47:23 2025
    From Newsgroup: comp.lang.forth

    Paul Rubin <no.email@nospam.invalid> writes:
    dxf <dxforth@gmail.com> writes:
    As for SSE2 it wouldn't exist if industry didn't consider
    double-precision adequate.

    SSE2 is/was first and foremost a vectorizing extension, and it has been >superseded quite a few times, indicating it was never all that
    adequate.

    But SSE2 was also the way to finally implement mainstream floating
    point: double precision instead of extended precision (with its
    double-rounding woes when trying to implement double precision) and
    registers (for which register allocation algorithms have been worked
    on for a long time) instead of the stack. So starting with AMD64
    (which was guaranteed to include SSE2) SSE2 became the preferred
    scalar floating point instruction set, which is also reflected in the
    ABIs on AMD64. And in this function SSE2 has not been superseded.

    Concerning vectors, AVX allows 256 bits of width, eliminates the
    alignment brain damage of SSE/SSE2, and gives us three-address
    instructions. AVX2 gives us integer instructions. The various
    AVX-512 extensions are a mess of overlapping extensions (to be unified
    by AVX10) that generally provide up to 512 bits of width and control
    of individual lanes with mask registers.

    I don't know whether any of its successors support extended
    precision though.

    No.

    W. Kahan was a big believer in extended precision (that's why the 8087
    had it from the start). I believes IEEE specifies both 80 bit and 128
    bit formats in addition to 64 bit.

    Not 80-bit format. binary128 and binary256 are specified.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Thu Jul 10 08:07:02 2025
    From Newsgroup: comp.lang.forth

    dxf <dxforth@gmail.com> writes:
    I suspect IEEE simply standardized what had become common practice among >implementers.

    Not at all. There was no common practice at the time.

    While there was some sentiment to standardize the VAX FP stuff, and as
    far as number formats are concerned, they almost did (IEEE binary32
    uses the same format as the VAX F, IEEE binary64 uses the same format
    as VAX G, and IEEE binary128 uses the same format as VAX H), if we
    ignore the perverse byte order of the VAX formats. However, IEEE FP
    uses a different bias for the exponent, requires implementing denormal
    numbers, infinities and NaNs.

    So actually none of the hardware manufacturers implemented IEEE FP at
    the time, not DEC, not IBM, and not Cray. And yet, industry accepted
    IEEE FP and within a few years all new architectures supported IEEE
    FP, and new models of existing hardware usually also implemented IEEE
    FP.

    By using 80 bits /internally/ Intel went a long way to
    achieving IEEE's spec for double precision.

    The 8087 did not just use 80 bits internally, it exposed them to
    programmers. When Intel released the 8087, IEEE 754 was not finished.
    But Kahan was both active in the standardization community and in the
    8087 development, so you can find his ideas in both. His and Intel's
    idea was that the 8087 would be IEEE standard-conforming, but given
    that the standard came out later, that was not quite the case.

    E.g. doing something as simple as changing
    sign of an fp number is a pain when NANs are factored in.

    I don't see that. When you change the sign of a NaN, it's still a
    NaN.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Thu Jul 10 08:35:49 2025
    From Newsgroup: comp.lang.forth

    dxf <dxforth@gmail.com> writes:
    The catch with SSE is there's nothing like FCHS or FABS
    so depending on how one implements them, results vary across implementations.

    You can see in Gforth how to implement FNEGATE and FABS with SSE2:

    see fnegate
    Code fnegate
    0x000055e6a78a8274: add $0x8,%rbx
    0x000055e6a78a8278: xorpd 0x24d8f(%rip),%xmm15 # 0x55e6a78cd010
    0x000055e6a78a8281: mov %r15,%r9
    0x000055e6a78a8284: mov (%rbx),%rax
    0x000055e6a78a8287: jmp *%rax
    end-code
    ok
    0x55e6a78cd010 16 dump
    55E6A78CD010: 00 00 00 00 00 00 00 80 - 00 00 00 00 00 00 00 00
    ok
    see fabs
    Code fabs
    0x000055e6a78a84fe: add $0x8,%rbx
    0x000055e6a78a8502: andpd 0x24b15(%rip),%xmm15 # 0x55e6a78cd020
    0x000055e6a78a850b: mov %r15,%r9
    0x000055e6a78a850e: mov (%rbx),%rax
    0x000055e6a78a8511: jmp *%rax
    end-code
    ok
    0x55e6a78cd020 16 dump
    55E6A78CD020: FF FF FF FF FF FF FF 7F - 00 00 00 00 00 00 00 00

    The actual implementation is the xorpd instruction for FNEGATE, and in
    the andpd instruction for FABS. The memory locations contain masks:
    for FNEGATE only the sign bit is set, for FABS everything but the sign
    bit is set.

    Sure you can implement FNEGATE and FABS in more complicated ways, but
    you can also implement them in more complicated ways if you use the
    387 instruction set. Here's an example of more complicated
    implementations:

    see fnegate
    FNEGATE
    ( 004C4010 4833C0 ) XOR RAX, RAX
    ( 004C4013 F34D0F7EC8 ) MOVQ XMM9, XMM8
    ( 004C4018 664C0F6EC0 ) MOVQ XMM8, RAX
    ( 004C401D F2450F5CC1 ) SUBSD XMM8, XMM9
    ( 004C4022 C3 ) RET/NEXT
    ( 19 bytes, 5 instructions )
    ok
    see fabs
    FABS
    ( 004C40B0 E8FBEFFFFF ) CALL 004C30B0 FS@
    ( 004C40B5 4885DB ) TEST RBX, RBX
    ( 004C40B8 488B5D00 ) MOV RBX, [RBP]
    ( 004C40BC 488D6D08 ) LEA RBP, [RBP+08]
    ( 004C40C0 0F8D05000000 ) JNL/GE 004C40CB
    ( 004C40C6 E845FFFFFF ) CALL 004C4010 FNEGATE
    ( 004C40CB C3 ) RET/NEXT
    ( 28 bytes, 7 instructions )

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Pelc@stephen@vfxforth.com to comp.lang.forth on Thu Jul 10 08:50:43 2025
    From Newsgroup: comp.lang.forth

    On 10 Jul 2025 at 02:18:50 CEST, "minforth" <minforth@gmx.net> wrote:

    "Industry" can manage well with 32-bit
    floats or even smaller with non-standard number formats.

    My customers beg to differ and some use 128 bit numbers for
    their work. In a construction estimate for one runway for the
    new Hong Kong airport, the cost difference between a 64 bit FP
    calculation and the integer calculation was US 10 million dollars.
    This was for pile capping which involves a large quantity of relatively
    small differences.

    Stephen
    --
    Stephen Pelc, stephen@vfxforth.com
    Wodni & Pelc GmbH
    Vienna, Austria
    Tel: +44 (0)7803 903612, +34 649 662 974 http://www.vfxforth.com/downloads/VfxCommunity/
    free VFX Forth downloads
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From minforth@minforth@gmx.net to comp.lang.forth on Thu Jul 10 12:14:24 2025
    From Newsgroup: comp.lang.forth

    Am 10.07.2025 um 10:50 schrieb Stephen Pelc:
    On 10 Jul 2025 at 02:18:50 CEST, "minforth" <minforth@gmx.net> wrote:

    "Industry" can manage well with 32-bit
    floats or even smaller with non-standard number formats.

    My customers beg to differ and some use 128 bit numbers for
    their work. In a construction estimate for one runway for the
    new Hong Kong airport, the cost difference between a 64 bit FP
    calculation and the integer calculation was US 10 million dollars.
    This was for pile capping which involves a large quantity of relatively
    small differences.

    You are right. "Industry" is one of those non-words that should be
    used with care, or avoided altogether, before it becomes a tautology.

    IIRC I only had one real application for 128-bit floats: simulation
    of heat propagation through thick-walled tubes. The simulation
    involved numerical integration which can be prone to error accumulation.
    One variant of MinForth's fp-number wordset can be built with gcc's
    libquadmath library. It is slower, but speed is not always important.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From dxf@dxforth@gmail.com to comp.lang.forth on Thu Jul 10 21:09:21 2025
    From Newsgroup: comp.lang.forth

    On 10/07/2025 6:35 pm, Anton Ertl wrote:
    dxf <dxforth@gmail.com> writes:
    The catch with SSE is there's nothing like FCHS or FABS
    so depending on how one implements them, results vary across implementations.

    You can see in Gforth how to implement FNEGATE and FABS with SSE2:

    see fnegate
    Code fnegate
    0x000055e6a78a8274: add $0x8,%rbx
    0x000055e6a78a8278: xorpd 0x24d8f(%rip),%xmm15 # 0x55e6a78cd010
    0x000055e6a78a8281: mov %r15,%r9
    0x000055e6a78a8284: mov (%rbx),%rax
    0x000055e6a78a8287: jmp *%rax
    end-code
    ok
    0x55e6a78cd010 16 dump
    55E6A78CD010: 00 00 00 00 00 00 00 80 - 00 00 00 00 00 00 00 00
    ok
    see fabs
    Code fabs
    0x000055e6a78a84fe: add $0x8,%rbx
    0x000055e6a78a8502: andpd 0x24b15(%rip),%xmm15 # 0x55e6a78cd020
    0x000055e6a78a850b: mov %r15,%r9
    0x000055e6a78a850e: mov (%rbx),%rax
    0x000055e6a78a8511: jmp *%rax
    end-code
    ok
    0x55e6a78cd020 16 dump
    55E6A78CD020: FF FF FF FF FF FF FF 7F - 00 00 00 00 00 00 00 00

    The actual implementation is the xorpd instruction for FNEGATE, and in
    the andpd instruction for FABS. The memory locations contain masks:
    for FNEGATE only the sign bit is set, for FABS everything but the sign
    bit is set.

    Sure you can implement FNEGATE and FABS in more complicated ways, but
    you can also implement them in more complicated ways if you use the
    387 instruction set. Here's an example of more complicated
    implementations:

    see fnegate
    FNEGATE
    ( 004C4010 4833C0 ) XOR RAX, RAX
    ( 004C4013 F34D0F7EC8 ) MOVQ XMM9, XMM8
    ( 004C4018 664C0F6EC0 ) MOVQ XMM8, RAX
    ( 004C401D F2450F5CC1 ) SUBSD XMM8, XMM9
    ( 004C4022 C3 ) RET/NEXT
    ( 19 bytes, 5 instructions )
    ok
    see fabs
    FABS
    ( 004C40B0 E8FBEFFFFF ) CALL 004C30B0 FS@
    ( 004C40B5 4885DB ) TEST RBX, RBX
    ( 004C40B8 488B5D00 ) MOV RBX, [RBP]
    ( 004C40BC 488D6D08 ) LEA RBP, [RBP+08]
    ( 004C40C0 0F8D05000000 ) JNL/GE 004C40CB
    ( 004C40C6 E845FFFFFF ) CALL 004C4010 FNEGATE
    ( 004C40CB C3 ) RET/NEXT
    ( 28 bytes, 7 instructions )

    The latter were basically what was existed in the implementation. As they don't handle -ve zero (or NANs) I swapped them out for the former ones you mention.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Thu Jul 10 12:33:52 2025
    From Newsgroup: comp.lang.forth

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    I believes IEEE specifies both 80 bit and 128 bit formats in addition
    to 64 bit.
    Not 80-bit format. binary128 and binary256 are specified.

    I see, 80 bits is considered double-extended. "The x87 and Motorola
    68881 80-bit formats meet the requirements of the IEEE 754-1985 double
    extended format,[12] as does the IEEE 754 128-bit binary format." (https://en.wikipedia.org/wiki/Extended_precision)

    Interestingly, Kahan's 1997 report on IEEE 754's status does say 80 bit
    is specified. But it sounds like that omits some nuance.

    https://people.eecs.berkeley.edu/~wkahan/ieee754status/IEEE754.PDF
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From minforth@minforth@gmx.net to comp.lang.forth on Thu Jul 10 23:16:27 2025
    From Newsgroup: comp.lang.forth

    Am 10.07.2025 um 21:33 schrieb Paul Rubin:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    I believes IEEE specifies both 80 bit and 128 bit formats in addition
    to 64 bit.
    Not 80-bit format. binary128 and binary256 are specified.

    I see, 80 bits is considered double-extended. "The x87 and Motorola
    68881 80-bit formats meet the requirements of the IEEE 754-1985 double extended format,[12] as does the IEEE 754 128-bit binary format." (https://en.wikipedia.org/wiki/Extended_precision)

    Interestingly, Kahan's 1997 report on IEEE 754's status does say 80 bit
    is specified. But it sounds like that omits some nuance.

    https://people.eecs.berkeley.edu/~wkahan/ieee754status/IEEE754.PDF

    Kahan was also overly critical of dynamic Unum/Posit formats.

    Time has shown that he was partially wrong: https://spectrum.ieee.org/floating-point-numbers-posits-processor
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Thu Jul 10 18:40:32 2025
    From Newsgroup: comp.lang.forth

    minforth <minforth@gmx.net> writes:
    Kahan was also overly critical of dynamic Unum/Posit formats.
    Time has shown that he was partially wrong: https://spectrum.ieee.org/floating-point-numbers-posits-processor

    I don't feel qualified to draw a conclusion from this. I wonder what
    the numerics community thinks, if there is any consensus. I remember
    being dubious of posits when I first heard of them, though Kahan
    probably influenced that. I do know that IEEE 754 took a lot of trouble
    to avoid undesirable behaviours that never would have occurred to most
    of us. No idea how well posits do at that. I guess though, given the continued attention they get, they must be more interesting than I had
    thought.

    I saw one of the posit articles criticizing IEEE 754 because IEEE 754
    addition is not always associative. But that is inherent in how
    floating point arithmetic works, and I don't see how posit addition can
    avoid it. Let a = 1e100, b = -1e100, and c=1. So mathematically,
    a+b+c=1. You should get that from (a+b)+c in your favorite floating
    point format. But a+(b+c) will almost certainly be 0, without very high precision (300+ bits).
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From dxf@dxforth@gmail.com to comp.lang.forth on Fri Jul 11 13:13:51 2025
    From Newsgroup: comp.lang.forth

    On 11/07/2025 7:16 am, minforth wrote:
    Am 10.07.2025 um 21:33 schrieb Paul Rubin:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    I believes IEEE specifies both 80 bit and 128 bit formats in addition
    to 64 bit.
    Not 80-bit format.  binary128 and binary256 are specified.

    I see, 80 bits is considered double-extended.  "The x87 and Motorola
    68881 80-bit formats meet the requirements of the IEEE 754-1985 double
    extended format,[12] as does the IEEE 754 128-bit binary format."
    (https://en.wikipedia.org/wiki/Extended_precision)

    Interestingly, Kahan's 1997 report on IEEE 754's status does say 80 bit
    is specified.  But it sounds like that omits some nuance.

    https://people.eecs.berkeley.edu/~wkahan/ieee754status/IEEE754.PDF

    Kahan was also overly critical of dynamic Unum/Posit formats.

    Time has shown that he was partially wrong: https://spectrum.ieee.org/floating-point-numbers-posits-processor

    When someone begins with the line it rarely ends well:

    "Twenty years ago anarchy threatened floating-point arithmetic."

    One floating-point to rule them all.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From minforth@minforth@gmx.net to comp.lang.forth on Fri Jul 11 05:15:49 2025
    From Newsgroup: comp.lang.forth

    Am 11.07.2025 um 03:40 schrieb Paul Rubin:
    minforth <minforth@gmx.net> writes:
    Kahan was also overly critical of dynamic Unum/Posit formats.
    Time has shown that he was partially wrong:
    https://spectrum.ieee.org/floating-point-numbers-posits-processor

    I don't feel qualified to draw a conclusion from this. I wonder what
    the numerics community thinks, if there is any consensus. I remember
    being dubious of posits when I first heard of them, though Kahan
    probably influenced that. I do know that IEEE 754 took a lot of trouble
    to avoid undesirable behaviours that never would have occurred to most
    of us. No idea how well posits do at that. I guess though, given the continued attention they get, they must be more interesting than I had thought.

    I saw one of the posit articles criticizing IEEE 754 because IEEE 754 addition is not always associative. But that is inherent in how
    floating point arithmetic works, and I don't see how posit addition can
    avoid it. Let a = 1e100, b = -1e100, and c=1. So mathematically,
    a+b+c=1. You should get that from (a+b)+c in your favorite floating
    point format. But a+(b+c) will almost certainly be 0, without very high precision (300+ bits).

    AFAIK Cuda does not support posits (yet). BFLOAT16 etc. still win the
    game, until the AI industry pours big money into the chip foundries
    for posit math GPUs.

    Even then, it is questionable, whether or when it would seep into the general-purpose CPU market.

    For Forthers to play with, of course. ;o)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Thu Jul 10 20:17:55 2025
    From Newsgroup: comp.lang.forth

    dxf <dxforth@gmail.com> writes:
    When someone begins with the line it rarely ends well:
    "Twenty years ago anarchy threatened floating-point arithmetic."
    One floating-point to rule them all.

    This gives a good perspective on posits:

    https://people.eecs.berkeley.edu/~demmel/ma221_Fall20/Dinechin_etal_2019.pdf

    Floating point arithmetic in the 1960s (before my time) was really in a terrible state. Kahan has written about it. Apparently IBM 360
    floating point arithmetic had to be redesigned after the fact, because
    the original version had such weird anomalies.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From dxf@dxforth@gmail.com to comp.lang.forth on Fri Jul 11 15:34:49 2025
    From Newsgroup: comp.lang.forth

    On 11/07/2025 1:17 pm, Paul Rubin wrote:
    dxf <dxforth@gmail.com> writes:
    When someone begins with the line it rarely ends well:
    "Twenty years ago anarchy threatened floating-point arithmetic."
    One floating-point to rule them all.

    This gives a good perspective on posits:

    https://people.eecs.berkeley.edu/~demmel/ma221_Fall20/Dinechin_etal_2019.pdf

    Floating point arithmetic in the 1960s (before my time) was really in a terrible state. Kahan has written about it. Apparently IBM 360
    floating point arithmetic had to be redesigned after the fact, because
    the original version had such weird anomalies.

    But was it the case by the mid/late 70's - or certain individuals saw an opportunity to influence the burgeoning microprocessor market? Notions of single and double precision already existed in software floating point -
    most notably in the Microsoft binary format. We're talking apps such as Microsoft's Fortran for CP/M. Back then MS was very serious about quashing
    any issues customers found.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From minforth@minforth@gmx.net to comp.lang.forth on Fri Jul 11 09:09:00 2025
    From Newsgroup: comp.lang.forth

    Am 11.07.2025 um 05:17 schrieb Paul Rubin:
    dxf <dxforth@gmail.com> writes:
    When someone begins with the line it rarely ends well:
    "Twenty years ago anarchy threatened floating-point arithmetic."
    One floating-point to rule them all.

    This gives a good perspective on posits:

    https://people.eecs.berkeley.edu/~demmel/ma221_Fall20/Dinechin_etal_2019.pdf


    Quintessence:

    Overburdened or incompetent programmers +
    Posits are tricky beasts ==>
    Programmers _need_ AI co-workers to avoid pitfalls

    Modern times....
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Fri Jul 11 07:02:05 2025
    From Newsgroup: comp.lang.forth

    minforth <minforth@gmx.net> writes:
    Am 10.07.2025 um 21:33 schrieb Paul Rubin:
    Kahan was also overly critical of dynamic Unum/Posit formats.

    Time has shown that he was partially wrong: >https://spectrum.ieee.org/floating-point-numbers-posits-processor

    What is supposed to be partially wrong?

    FP numbers have a number of not-so-nice properties, and John L,
    Gustafson uses that somewhat successfully to sell his alternatives to
    the gullible. The way to do that is to give some examples where
    traditional FP numbers fail and his alternative under consideration
    works. I have looked at a (IIRC) slide deck by Kahan where he shows
    examples where the altenarnative by Gustafson (don't remember which
    one he looked at in that slide deck) fails and traditional FP numbers
    work.

    Where does that leave us? Kahan makes the good argument that
    numerical analysts have worked out techniques to deal with the
    shortcomings of traditional FP numbers for over 70 years. For
    Gustafson's number formats these techniques are not applicable; maybe
    one can find new ones for these number formats, but that's not clear.

    For Posits (Type III Unums), which are close to traditional FP in many respects, one can see how that would work out; while traditional FP
    has a fixed division between mantissa and exponents, in Posits the
    division depends on the size of the exponent. This means that
    reasoning about the accuracy of the computation would have to consider
    the size of the exponent, and is therefore more complex than for
    traditional FP; with a little luck you can produce a result that gives
    an error bound based on the smallest mantissa size, but that error
    bound will be worse than for tranditional FP.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Fri Jul 11 00:55:43 2025
    From Newsgroup: comp.lang.forth

    dxf <dxforth@gmail.com> writes:
    But was it the case by the mid/late 70's - or certain individuals saw an opportunity to influence the burgeoning microprocessor market? Notions of single and double precision already existed in software floating point -

    Hardware floating point also had single and double precision. The
    really awful 1960s systems were gone by the mid 70s. But there were a
    lot of competing formats, ranging from bad to mostly-ok. VAX floating
    point was mostly ok, DEC wanted IEEE to adopt it, Kahan was ok with
    that, but Intel thought "go for the best possible". Kahan's
    retrospectives on this stuff are good reading:

    http://people.eecs.berkeley.edu/~wkahan/index.htm

    I've linked a few of them. I liked the quote

    It was remarkable that so many hardware people there, knowing how
    difficult p754 would be, agreed that it should benefit the community
    at large. If it encouraged the production of floating-point software
    and eased the development of reliable software, it would help create a
    larger market for everyone's hardware. This degree of altruism was so
    astonishing that MATLAB's creator Dr. Cleve Moler used to advise
    foreign visitors not to miss the country's two most awesome
    spectacles: the Grand Canyon, and meetings of IEEE p754.

    from http://people.eecs.berkeley.edu/~wkahan/ieee754status/754story.html
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Fri Jul 11 01:15:00 2025
    From Newsgroup: comp.lang.forth

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    I have looked at a (IIRC) slide deck by Kahan where he shows examples
    where the altenarnative by Gustafson (don't remember which one he
    looked at in that slide deck) fails and traditional FP numbers work.

    Maybe this: http://people.eecs.berkeley.edu/~wkahan/UnumSORN.pdf
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Fri Jul 11 07:27:19 2025
    From Newsgroup: comp.lang.forth

    Paul Rubin <no.email@nospam.invalid> writes:
    I guess though, given the
    continued attention they get, they must be more interesting than I had >thought.

    IMO it's the usual case of a somewhat complex topic where existing
    solutions have undesirable properties, and someone promises a solution
    that supposedly solves these problems. The attention comes from the
    problems, not from the merits of the promised solution.

    There has been attention given to research into the philosopher's
    stone for many centuries; I don't think that makes it interesting
    other than as an example of how people fall for promises.

    I saw one of the posit articles criticizing IEEE 754 because IEEE 754 >addition is not always associative. But that is inherent in how
    floating point arithmetic works, and I don't see how posit addition can
    avoid it.

    If you only added posits of a given width, you couldn't. Therefore
    the posit specification also defines quire<n> types, which are
    fixed-point numbers that can represent all the values of the posit<n>
    types plus additional bits such that a sequence of a lot of additions
    does not overflow. If you add the posits using a quire as
    accumulator, and only then convert back to a posit, the whole thing is associative.

    Of course you could also introduce a fixed-point accumulator for
    traditional FP numbers and get the same benefit without using posits
    for the rest.

    A problem is how these accumulator types are represented in
    programming languages. If somebody writes

    0e n 0 ?do a i th f@ f+ loop x f!

    should the 0e be stored in the accumulator, and F+ be translated to an
    addition to the accumulator, and should the F! then convert the
    accumulator to FP? What about

    0e x f! n 0 ?do x f@ a i th f@ f+ x f! loop

    In Forth I would make the accumulator explicit, with separate
    FP-to-accumulator addition operations and explicit accumulator-to-fp conversion, but I expect that many people (across programming
    languages) would prefer an automatic approach that works with existing
    source code. We see that with auto-vectorization.

    How big would the accumulator be? Looking at <https://en.wikipedia.org/wiki/Unum_(number_format)#Quire>, for
    posit32 (the largest format given on the page) the quire32 type would
    have 512 bits, and would allow adding up of 2^151 posit32 numbers.

    Let's see how big an accumulator for binary32 would have to be: There
    are exponents for finite numbers from -126..127, i.e., 254 finite
    exponent values, and 23 mantissa bits, plus the sign bit, so every
    binary32 number can be represented as a 278-bit fixed-point number
    (with scale factor 2^-149). If you want to also allow intermediate
    results of, say, 2^64 additions (good for 97 years of additions at 6G
    additions per second), that increases the accumulator to 342 bits; but
    note that the bigger numbers can only be represented as infinity in
    binary32.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Fri Jul 11 08:57:09 2025
    From Newsgroup: comp.lang.forth

    On Fri, 11 Jul 2025 7:55:43 +0000, Paul Rubin wrote:

    dxf <dxforth@gmail.com> writes:
    But was it the case by the mid/late 70's - or certain individuals saw an
    opportunity to influence the burgeoning microprocessor market? Notions
    of
    single and double precision already existed in software floating point -

    Hardware floating point also had single and double precision. The
    really awful 1960s systems were gone by the mid 70s. But there were a
    lot of competing formats, ranging from bad to mostly-ok. VAX floating
    point was mostly ok, DEC wanted IEEE to adopt it, Kahan was ok with
    that, but Intel thought "go for the best possible". Kahan's
    retrospectives on this stuff are good reading:

    What is there not to like with the FPU? It provides 80 bits, which
    is in itself a useful additional format, and should never have problems
    with single and double-precision edge cases. Plus it does all the
    trigonometric and transcendental stuff with a reasonable precision
    out-of-box. The instruction set is very regular and quite Forth-like.
    The only problem is that some languages and companies find it necessary
    to boycott FPU use.

    -marcel

    --
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Fri Jul 11 08:33:12 2025
    From Newsgroup: comp.lang.forth

    dxf <dxforth@gmail.com> writes:
    On 11/07/2025 1:17 pm, Paul Rubin wrote:
    This gives a good perspective on posits:

    https://people.eecs.berkeley.edu/~demmel/ma221_Fall20/Dinechin_etal_2019.pdf

    Yes, that looks ok. One thing I noticed is that they suggest
    implementing the smaller posit formats by intelligent table lookup.
    If we have small bit widths and table lookup, I wonder if we should go
    for any variant of FP (including posits) at all, or if an
    exponent-only (i.e., logarithmic) representation would not be better.
    E.g., for 8 bits, out of the 256 values, 2 would represent infinities,
    one would represent NaN, and one would represent 0, leaving 252
    remaining values. If we use 2^-11 (~1.065) as base B, this would give
    a number range of B^-126=0.000356 to B^125=2635. You can vary B to
    either give a more fine-grained resolution at the expense of a smaller
    number range or a larger number range at the expense of a finer
    resolution. <https://developer.nvidia.com/blog/floating-point-8-an-introduction-to-efficient-lower-precision-ai-training/>
    presents E4M3 with +-448 range, and E5M2 with +-57344 range. But note
    that the next number after 1 is 1.125 for E4M3 and 1.25 for E5M2, both
    more coarse-grained than the 1.065 that an exponent-only format with
    B=2^-11 gives you.

    Addition and subtraction would be performed by table lookup (and would
    almost always be approximate), for multiplication and division an
    integer adder can be used.

    Floating point arithmetic in the 1960s (before my time) was really in a
    terrible state. Kahan has written about it. Apparently IBM 360
    floating point arithmetic had to be redesigned after the fact, because
    the original version had such weird anomalies.

    But was it the case by the mid/late 70's - or certain individuals saw an >opportunity to influence the burgeoning microprocessor market?

    Yes, that's the thing with FP. Some people just do their computations
    and who cares if the results might be an artifact of numerical
    instability. For wheather forecasts, there is no telling if a bad
    prediction is due to a numerical error, due to imperfect measurement
    data, or because of the butterfly effect (which is a convenient
    excuse).

    Other people care more about the results, and perform numerical
    analysis. There are only a few specialists for that, and they have
    asked for and gotten features in IEEE 754 and the hardware that the
    vast majority of programmers never consciously uses, e.g., rounding
    modes or the inexact "exception" (actually a flag, not a Forth
    exception), which allows them to tell if there was a rounding error in
    a computation. But when you use a library designed with the help of
    numerical analysis, you might benefit from the existence of these
    features.

    They have also asked for and gotten things like denormal numbers,
    infinities and NaNs that result in fewer numerical pitfalls for
    programmers who are not numerical analysts. These features may be
    irrelevant for those who do weather prediction, but I expect that
    those who found that binary64 provided by VFX's SSE2-based package was
    not good enough may benefit from such features.

    In any case, FP numbers are used in very diverse ways. Not everybody
    needs all the features, and even fewer features are consciously
    needed, but that's the usual case with things that are not
    custom-taylored for your application.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Fri Jul 11 10:14:49 2025
    From Newsgroup: comp.lang.forth

    Paul Rubin <no.email@nospam.invalid> writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    I have looked at a (IIRC) slide deck by Kahan where he shows examples
    where the altenarnative by Gustafson (don't remember which one he
    looked at in that slide deck) fails and traditional FP numbers work.

    Maybe this: http://people.eecs.berkeley.edu/~wkahan/UnumSORN.pdf

    Yes.

    Here's a quote:

    | These claims pander to Ignorance and Wishful Thinking.

    That's my impression, too, and not just for Type I unums.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Fri Jul 11 10:22:54 2025
    From Newsgroup: comp.lang.forth

    mhx@iae.nl (mhx) writes:
    What is there not to like with the FPU? It provides 80 bits, which
    is in itself a useful additional format, and should never have problems
    with single and double-precision edge cases.

    If you want to do double precision, using the 387 stack has the
    double-rounding problem <https://en.wikipedia.org/wiki/Rounding#Double_rounding>. Even if you
    limit the mantissa to 53 bits, you still get double rounding when you
    deal with numbers that are denormal numbers in binary64
    representation. Java wanted to give the same results, bit for bit, on
    all hardware, and ran afoul of this until they could switch to SSE2.

    The only problem is that some languages and companies find it necessary
    to boycott FPU use.

    The rest of the industry has standardized on binary64 and binary32,
    and they prefer bit-equivalent results for ease of testing. So as
    soon as SSE2 gave that to them, they flocked to SSE2.

    Another nudge towards binary64 (and binary32) is autovectorization.
    You don't want to get different results depending on whether the
    compiler manages to auto-vectorize a program (and use SSE2 parallel
    (rather than scalar) instructions, AVX, or AVX-512) or not. So you
    also use SSE2 when it fails to auto-vectorize.

    OTOH, e.g., on gcc you can ask for -mfpmath=386, for -mfpmath=sse or
    for -mfpmath=both; or if you define a variable as "long double", it
    will store an 80-bit FP value, and computations involving this
    variable will be done on the 387.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From dxf@dxforth@gmail.com to comp.lang.forth on Fri Jul 11 22:35:30 2025
    From Newsgroup: comp.lang.forth

    On 11/07/2025 8:22 pm, Anton Ertl wrote:
    mhx@iae.nl (mhx) writes:
    What is there not to like with the FPU? It provides 80 bits, which
    is in itself a useful additional format, and should never have problems
    with single and double-precision edge cases.

    If you want to do double precision, using the 387 stack has the double-rounding problem <https://en.wikipedia.org/wiki/Rounding#Double_rounding>. Even if you
    limit the mantissa to 53 bits, you still get double rounding when you
    deal with numbers that are denormal numbers in binary64
    representation. Java wanted to give the same results, bit for bit, on
    all hardware, and ran afoul of this until they could switch to SSE2.

    The only problem is that some languages and companies find it necessary
    to boycott FPU use.

    The rest of the industry has standardized on binary64 and binary32,
    and they prefer bit-equivalent results for ease of testing. So as
    soon as SSE2 gave that to them, they flocked to SSE2.
    ...

    I wonder how much of this is academic or trend inspired? AFAICS Forth
    clients haven't flocked to it else vendors would have SSE2 offerings at
    the same level as their x387 packs.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From minforth@minforth@gmx.net to comp.lang.forth on Sat Jul 12 06:53:22 2025
    From Newsgroup: comp.lang.forth

    Am 11.07.2025 um 10:33 schrieb Anton Ertl:
    In any case, FP numbers are used in very diverse ways. Not everybody
    needs all the features, and even fewer features are consciously
    needed, but that's the usual case with things that are not
    custom-taylored for your application.


    The strongest application niche for Forth is embedded devices, e.g.
    MCUs. The ADCs often used there have typical bit widths of 12 to 24 bit
    (e.g. 24 bit for audio). So there are definitely areas of application
    for Forth and small floats (after de-/normalization).

    In some PLC/DCS, float24 is the usable width of a 32-bit word, whereby
    the 8 free bits are used as binary companions, e.g. for measured value
    over limit.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sun Jul 13 09:01:41 2025
    From Newsgroup: comp.lang.forth

    dxf <dxforth@gmail.com> writes:
    On 11/07/2025 8:22 pm, Anton Ertl wrote:
    The rest of the industry has standardized on binary64 and binary32,
    and they prefer bit-equivalent results for ease of testing. So as
    soon as SSE2 gave that to them, they flocked to SSE2.
    ...

    I wonder how much of this is academic or trend inspired?

    Is ease of testing an academic concern or a trend?

    AFAICS Forth
    clients haven't flocked to it else vendors would have SSE2 offerings at
    the same level as their x387 packs.

    For Forth, Inc. and MPE AFAIK their respective IA-32 Forth system was
    the only one with hardware FP for many years, so there probably was
    little pressure from users for bit-identical results with, say, SPARC,
    because they did not have a Forth system that ran on SPARC.

    And when they did their IA-32 systems, SSE2 did not exist, so of
    course they used the 387. Plus, 387 was guaranteed to be available
    with Intel's Pentium and AMD's K5, while SSE2 was only available on
    the Pentium 4 and the Athlon 64; so for many years there was a good
    reason to prefer 387 over SSE2 if you compiled for IA-32. And gcc
    generated 387 code to this day if you ask it to produce code for
    IA-32. Only with AMD64 SSE2 was guaranteed, and only there gcc
    defaults to it if you use float or double. Now SwiftForth and VFX are
    only being ported to AMD64 since a relatively short time.

    And as long as customers did not ask for bit-identical results to
    those on, say, a Raspi, there was little reason to reimplement FP with
    SSE2. I wonder if the development of the SSE2 package for VFX was
    influenced by the availability of VFX for the Raspi.

    These Forth systems also don't do global register allocation or auto-vectorization, so two other reasons why, e.g., C compilers chose
    to use SSE2 on AMD64 (where SSE2 was guaranteed to be available) don't
    exist for them.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From dxf@dxforth@gmail.com to comp.lang.forth on Sun Jul 13 21:28:43 2025
    From Newsgroup: comp.lang.forth

    On 13/07/2025 7:01 pm, Anton Ertl wrote:
    ...
    For Forth, Inc. and MPE AFAIK their respective IA-32 Forth system was
    the only one with hardware FP for many years, so there probably was
    little pressure from users for bit-identical results with, say, SPARC, because they did not have a Forth system that ran on SPARC.

    What do you mean by "bit-identical results"? Since SSE2 comes without transcendentals (or basics such as FABS and FNEGATE) and implementers
    are expected to supply their own, if anything, I expect results across platforms and compilers to vary.

    ...
    And as long as customers did not ask for bit-identical results to
    those on, say, a Raspi, there was little reason to reimplement FP with
    SSE2. I wonder if the development of the SSE2 package for VFX was
    influenced by the availability of VFX for the Raspi.

    According to the change log it originally began as software floating
    point for embedded systems and circa 2020 was converted to SSE and x64.
    Perhaps Stephen can advise as to the reasons.


    These Forth systems also don't do global register allocation or auto-vectorization, so two other reasons why, e.g., C compilers chose
    to use SSE2 on AMD64 (where SSE2 was guaranteed to be available) don't
    exist for them.

    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Mon Jul 14 06:04:13 2025
    From Newsgroup: comp.lang.forth

    dxf <dxforth@gmail.com> writes:
    On 13/07/2025 7:01 pm, Anton Ertl wrote:
    ...
    For Forth, Inc. and MPE AFAIK their respective IA-32 Forth system was
    the only one with hardware FP for many years, so there probably was
    little pressure from users for bit-identical results with, say, SPARC,
    because they did not have a Forth system that ran on SPARC.

    What do you mean by "bit-identical results"? Since SSE2 comes without >transcendentals (or basics such as FABS and FNEGATE) and implementers
    are expected to supply their own, if anything, I expect results across >platforms and compilers to vary.

    There are operations for which IEEE 754 specifies the result to the
    last bit (except that AFAIK the representation of NaNs is not
    specified exactly), among them F+ F- F* F/ FSQRT, probably also
    FNEGATE and FABS. It does not specify the exact result for
    transcendental functions, but if your implementation performs the same bit-exact operations for computing a transcendental function on two
    IEEE 754 compliant platforms, the result will be bit-identical (if it
    is a number). So just use the same implementations of transcentental functions, and your results will be bit-identical; concerning the
    NaNs, if you find a difference, check if the involved values are NaNs.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From peter@peter.noreply@tin.it to comp.lang.forth on Mon Jul 14 09:09:00 2025
    From Newsgroup: comp.lang.forth

    On Mon, 14 Jul 2025 06:04:13 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    dxf <dxforth@gmail.com> writes:
    On 13/07/2025 7:01 pm, Anton Ertl wrote:
    ...
    For Forth, Inc. and MPE AFAIK their respective IA-32 Forth system
    was the only one with hardware FP for many years, so there
    probably was little pressure from users for bit-identical results
    with, say, SPARC, because they did not have a Forth system that
    ran on SPARC.

    What do you mean by "bit-identical results"? Since SSE2 comes
    without transcendentals (or basics such as FABS and FNEGATE) and >implementers are expected to supply their own, if anything, I expect >results across platforms and compilers to vary.

    There are operations for which IEEE 754 specifies the result to the
    last bit (except that AFAIK the representation of NaNs is not
    specified exactly), among them F+ F- F* F/ FSQRT, probably also
    FNEGATE and FABS. It does not specify the exact result for
    transcendental functions, but if your implementation performs the same bit-exact operations for computing a transcendental function on two
    IEEE 754 compliant platforms, the result will be bit-identical (if it
    is a number). So just use the same implementations of transcentental functions, and your results will be bit-identical; concerning the
    NaNs, if you find a difference, check if the involved values are NaNs.

    - anton

    This of course excludes the use of libm or other math libraries provided
    by the distribution. They will change between releases.
    I have with success used fdlibm, that is the base for many others. I
    gives max 1 ulp rounding error. I have now also tested the core-math
    project https://gitlab.inria.fr/core-math/core-math This gives
    correctly rounded functions at the cost of being 10 times the compiled
    size! A complete library with trig, log, pow etc comes in at 500k.

    Peter

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Mon Jul 14 07:21:45 2025
    From Newsgroup: comp.lang.forth

    On Mon, 14 Jul 2025 6:04:13 +0000, Anton Ertl wrote:

    [..] if your implementation performs the same
    bit-exact operations for computing a transcendental function on two
    IEEE 754 compliant platforms, the result will be bit-identical (if it
    is a number). So just use the same implementations of transcentental functions, and your results will be bit-identical; concerning the
    NaNs, if you find a difference, check if the involved values are NaNs.

    When e.g. summing the elements of a DP vector, it is hard to see why
    that couldn't be done on the FPU stack (with 80 bits) before (possibly)
    storing the result to a DP variable in memory. I am not sure that Forth
    users would be able to resist that approach.

    -marcel

    --
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Mon Jul 14 01:24:03 2025
    From Newsgroup: comp.lang.forth

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    So just use the same implementations of transcentental functions, and
    your results will be bit-identical

    Same implementations = same FP operations in the exact same order? That
    seems hard to ensure, if the functions are implemented in a language
    that leaves anything up to a compiler.

    Also, in the early implementations x87, 68881, NS320something(?), transcententals were included in the coprocessor and the workings
    weren't visible. There is a proposal to add this to RISC-V (https://libre-soc.org/ztrans_proposal/). It looks like there was an
    AVX-512 ER subset that also does transcententals, but it only appeared
    on some Xeon Phi processors now discontinued (per Wikipedia article on
    AVX). No idea about other processors.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Mon Jul 14 07:50:04 2025
    From Newsgroup: comp.lang.forth

    mhx@iae.nl (mhx) writes:
    On Mon, 14 Jul 2025 6:04:13 +0000, Anton Ertl wrote:

    [..] if your implementation performs the same
    bit-exact operations for computing a transcendental function on two
    IEEE 754 compliant platforms, the result will be bit-identical (if it
    is a number). So just use the same implementations of transcentental
    functions, and your results will be bit-identical; concerning the
    NaNs, if you find a difference, check if the involved values are NaNs.

    When e.g. summing the elements of a DP vector, it is hard to see why
    that couldn't be done on the FPU stack (with 80 bits) before (possibly) >storing the result to a DP variable in memory. I am not sure that Forth
    users would be able to resist that approach.

    The question is: What properties do you want your computation to have?

    1) Bit-identical result to a naively-coded IEEE 754 DP computation?

    2) A more accurate result? How much more accuracy?

    3) More performance?

    If you want 1), there is little alternative to actually performing the operations sequentially, using scalar SSE2 operations.

    If you can live without 1), there's a wide range of options:

    A) Perform the naive summation, but using 80-bit addition. This will
    produce higher accuracy, but limit performance to typically 4
    cycles or so per addition (as does the naive SSE2 approach),
    because the latency of the floating-point addition is 4 cycles or
    so (depending on the actual processor).

    B) Perform vectorized summation using SIMD instructions (e.g.,
    AVX-512), with enough parallel additions (beyond the vector size)
    that either the load unit throughput, the FPU throughput, or the
    instruction issue rate will limit the performance. Reduce the n
    intermediate results to one intermediate result in the end. If I
    give the naive loop to gcc -O3 and allow it to pretend that
    floating-point addition is associative, it produces such a
    computation automatically. The result will typically be a little
    more accurate than the result of 1), because the length of the
    addition chains is lenth(vector)/lanes+ld(lanes) rather than
    length(vector).

    C) Perform tree addition

    a) Using 80-bit addition. This will be faster than sequential
    addition because in many cases several additions can run in
    parallel. It will also be quite accurate because it uses 80-bit
    addition, and because the addition chains are reduced to
    ld(length(vector)).

    b) Using DP addition. This allows to use SIMD instructions for
    increased performance (except near the root of the tree), but the
    accuracy is not as good as with 80-bit addition. It is still
    good because the length of the addition chains is only
    ld(length(vector)).

    D) Use Kahan summation (you must not allow the compiler to pretend
    that FP addition is associative, or this will not work) or one of
    its enhancements. This provides a very high accuracy, but (in case
    of the original Kahan summation) requires four FP operations for
    each summand, and each operation depends on the previous one. So
    you get the latency of 4 FP additions per iteration for a version
    that goes across the array sequentially. You can apply
    vectorization to eliminate the effect of these latencies, but you
    will still see the increased resource consumption. If the vector
    resides in a distant cache or in main memory, the memory limit may
    limit performance more than lack of FPU resources, however.

    E) Sort the vector, then start with the element closest to 0. At
    every step, add the element of the sign other than the current
    intermediate sum that is closest to 0. If there is no such element
    left, add the remaining elements in order, starting with the one
    closest to 0. This is pretty accurate and slower than naive
    addition. At the current relative costs of sorting and FP
    operations, Kahan summation probably dominates over this approach.


    So, as you can see, depending on your objectives there may be more
    attractive ways to add a vector than what you suggested. Your
    suggestion actually looks pretty unattractive, except if your
    objectives are "ease of implementation" and "more accuracy than the
    naive approach".

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Mon Jul 14 10:11:57 2025
    From Newsgroup: comp.lang.forth

    Paul Rubin <no.email@nospam.invalid> writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    So just use the same implementations of transcentental functions, and
    your results will be bit-identical

    Same implementations = same FP operations in the exact same order?

    Same operations with the same data flow. Independent operations can
    be reordered.

    That
    seems hard to ensure, if the functions are implemented in a language
    that leaves anything up to a compiler.

    Even gcc heeds data flow of FP operations unless you tell it with
    -fastmath that anything goes.

    Also, in the early implementations x87, 68881, NS320something(?), >transcententals were included in the coprocessor and the workings
    weren't visible.

    The bigger problem with at least x87 is that math you don't always get bit-identical results even for basic operations such as addition,
    thanks to double rounding. So even if you implement transcendentals
    yourself based basic operations, you can see results that are not bit-identical.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Mon Jul 14 18:13:34 2025
    From Newsgroup: comp.lang.forth

    On Mon, 14 Jul 2025 7:50:04 +0000, Anton Ertl wrote:

    mhx@iae.nl (mhx) writes:
    On Mon, 14 Jul 2025 6:04:13 +0000, Anton Ertl wrote:
    [..]
    The question is: What properties do you want your computation to have?
    [..]
    2) A more accurate result? How much more accuracy?

    3) More performance?

    3) + 2). If the result is more accurate, the condition number of
    matrices should be better, resulting in less LU decomposition
    iterations. However, solving the system matrix normally takes
    less than 20% of the total runtime.

    I've never seen *anybody* worry about the numerical accuracy of
    final simulation results.

    [..]
    C) Perform tree addition

    a) Using 80-bit addition. This will be faster than sequential
    addition because in many cases several additions can run in
    parallel. It will also be quite accurate because it uses 80-bit
    addition, and because the addition chains are reduced to
    ld(length(vector)).

    This looks very interesting. I can find Kahan and Neumaier, but
    "tree addition" didn't turn up (There is a suspicious looking
    reliability paper about the approach which surely is not what
    you meant). Or is it pairwise addition what I should look for?

    So, as you can see, depending on your objectives there may be more
    attractive ways to add a vector than what you suggested. Your
    suggestion actually looks pretty unattractive, except if your
    objectives are "ease of implementation" and "more accuracy than the
    naive approach".

    Sure, "ease of implementation" is high on my list too. Life is too
    short.

    Thank you for your wonderful and very useful suggestions.

    -marcel

    --
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Mon Jul 14 11:31:24 2025
    From Newsgroup: comp.lang.forth

    mhx@iae.nl (mhx) writes:
    This looks very interesting. I can find Kahan and Neumaier, but
    "tree addition" didn't turn up (There is a suspicious looking
    reliability paper about the approach which surely is not what
    you meant). Or is it pairwise addition what I should look for?

    I think the idea is to treat (say) a 1024 element sum into two
    512-element sums that you compute separately, then add the results
    together. You do the 512-element sums the same way, recursively.
    Sometimes you can parallelize the computations, and depending on the CPU
    you might be able to use vector or SIMD instructions once the chunks are
    small enough.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Tue Jul 15 15:25:51 2025
    From Newsgroup: comp.lang.forth

    Now riscv is the future.

    I don't know. From what I learned, RISC-V
    is strongly compiler-oriented. They wrote,
    for example, that it lacks any condition codes.
    Only conditional branches are predicated on
    examining the contents of registers at the time
    of the branch. No "add with carry" nor "subtract
    with carry". From an assembly point of view, the
    lack of a carry flag is a PITA if you desire to
    do multi-word mathematical manipulation of numbers.

    So it seems, that the RISC-V architecture is intended
    to be used by compilers generating code from high level
    languages. Therefore I rather still prefer that "closed"
    ARM arch. Besides: it's more ubiquitous and cheaper.

    --
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From minforth@minforth@gmx.net to comp.lang.forth on Wed Jul 16 04:09:09 2025
    From Newsgroup: comp.lang.forth

    Am 15.07.2025 um 17:25 schrieb LIT:
    Now riscv is the future.

    I don't know. From what I learned, RISC-V
    is strongly compiler-oriented. They wrote,
    for example, that it lacks any condition codes.
    Only conditional branches are predicated on
    examining the contents of registers at the time
    of the branch. No "add with carry" nor "subtract
    with carry". From an assembly point of view, the
    lack of a carry flag is a PITA if you desire to
    do multi-word mathematical manipulation of numbers.

    So it seems, that the RISC-V architecture is intended
    to be used by compilers generating code from high level
    languages.

    I read somewhere:
    The standard is now managed by RISC-V International, which
    has more than 3,000 members and which reported that more
    than 10 billion chips containing RISC-V cores had shipped
    by the end of 2022. Many implementations of RISC-V are
    available, both as open-source cores and as commercial
    IP products.

    You call that compiler-oriented???


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From dxf@dxforth@gmail.com to comp.lang.forth on Wed Jul 16 15:21:32 2025
    From Newsgroup: comp.lang.forth

    On 16/07/2025 12:09 pm, minforth wrote:
    Am 15.07.2025 um 17:25 schrieb LIT:
    Now riscv is the future.

    I don't know. From what I learned, RISC-V
    is strongly compiler-oriented. They wrote,
    for example, that it lacks any condition codes.
    Only conditional branches are predicated on
    examining the contents of registers at the time
    of the branch. No "add with carry" nor "subtract
    with carry". From an assembly point of view, the
    lack of a carry flag is a PITA if you desire to
    do multi-word mathematical manipulation of numbers.

    So it seems, that the RISC-V architecture is intended
    to be used by compilers generating code from high level
    languages.

    I read somewhere:
    The standard is now managed by RISC-V International, which
    has more than 3,000 members and which reported that more
    than 10 billion chips containing RISC-V cores had shipped
    by the end of 2022. Many implementations of RISC-V are
    available, both as open-source cores and as commercial
    IP products.

    You call that compiler-oriented???

    It depends on how many are being programmed by the likes of GCC.
    When ATMEL hit the market the manufacturer claimed their chips
    were designed with compilers in mind. Do Arduino users program
    in hand-coded assembler? Do you? It's no longer just the chip's
    features and theoretical performance one has to worry about but
    the compilers too.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From minforth@minforth@gmx.net to comp.lang.forth on Wed Jul 16 07:41:26 2025
    From Newsgroup: comp.lang.forth

    Am 16.07.2025 um 07:21 schrieb dxf:
    On 16/07/2025 12:09 pm, minforth wrote:
    Am 15.07.2025 um 17:25 schrieb LIT:
    Now riscv is the future.

    I don't know. From what I learned, RISC-V
    is strongly compiler-oriented. They wrote,
    for example, that it lacks any condition codes.
    Only conditional branches are predicated on
    examining the contents of registers at the time
    of the branch. No "add with carry" nor "subtract
    with carry". From an assembly point of view, the
    lack of a carry flag is a PITA if you desire to
    do multi-word mathematical manipulation of numbers.

    So it seems, that the RISC-V architecture is intended
    to be used by compilers generating code from high level
    languages.

    I read somewhere:
    The standard is now managed by RISC-V International, which
    has more than 3,000 members and which reported that more
    than 10 billion chips containing RISC-V cores had shipped
    by the end of 2022. Many implementations of RISC-V are
    available, both as open-source cores and as commercial
    IP products.

    You call that compiler-oriented???

    It depends on how many are being programmed by the likes of GCC.
    When ATMEL hit the market the manufacturer claimed their chips
    were designed with compilers in mind. Do Arduino users program
    in hand-coded assembler? Do you? It's no longer just the chip's
    features and theoretical performance one has to worry about but
    the compilers too.


    Don't worry, be happy, visit https://riscv.org/

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Wed Jul 16 08:20:04 2025
    From Newsgroup: comp.lang.forth

    Now riscv is the future.

    I don't know. From what I learned, RISC-V
    is strongly compiler-oriented. They wrote,
    for example, that it lacks any condition codes.
    Only conditional branches are predicated on
    examining the contents of registers at the time
    of the branch. No "add with carry" nor "subtract
    with carry". From an assembly point of view, the
    lack of a carry flag is a PITA if you desire to
    do multi-word mathematical manipulation of numbers.

    So it seems, that the RISC-V architecture is intended
    to be used by compilers generating code from high level
    languages.

    I read somewhere:
    The standard is now managed by RISC-V International, which
    has more than 3,000 members and which reported that more
    than 10 billion chips containing RISC-V cores had shipped
    by the end of 2022. Many implementations of RISC-V are
    available, both as open-source cores and as commercial
    IP products.

    You call that compiler-oriented???

    I think it doesn't depend on RISCV members count,
    but on technical specs/abilities of CPU rather.
    Like on the ones I listed, for instance.

    --
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Wed Jul 16 08:25:06 2025
    From Newsgroup: comp.lang.forth

    It depends on how many are being programmed by the likes of GCC.
    When ATMEL hit the market the manufacturer claimed their chips
    were designed with compilers in mind. Do Arduino users program
    in hand-coded assembler? Do you? It's no longer just the chip's
    features and theoretical performance one has to worry about but
    the compilers too.

    Regarding features it's worth to mention
    that ATMELs actually are quite nice to
    program them in ML. Even, if they were
    designed "with compilers in mind".

    But when CPU is stripped off SBC/ADC and
    similar... I don't know.

    --
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Wed Jul 16 11:25:04 2025
    From Newsgroup: comp.lang.forth

    mhx@iae.nl (mhx) writes:
    On Mon, 14 Jul 2025 7:50:04 +0000, Anton Ertl wrote:
    C) Perform tree addition

    a) Using 80-bit addition. This will be faster than sequential
    addition because in many cases several additions can run in
    parallel. It will also be quite accurate because it uses 80-bit
    addition, and because the addition chains are reduced to
    ld(length(vector)).

    This looks very interesting. I can find Kahan and Neumaier, but
    "tree addition" didn't turn up (There is a suspicious looking
    reliability paper about the approach which surely is not what
    you meant). Or is it pairwise addition what I should look for?

    Yes, "tree addition" is not a common term, and Wikipedia calls it
    pairwise addition. Except that unlike suggeseted in <https://en.wikipedia.org/wiki/Pairwise_summation> I would not switch to
    a sequential approach for small n, for both accuracy and performance.
    In any case the idea is to turn the evaluation tree from a degenerate
    tree into a balanced tree. E.g., if you add up a, b, c, and d, then
    the naive evaluation

    a b f+ c f+ d f+

    has the evaluation tree

    a b
    \ /
    f+ c
    \ /
    f+ d
    \ /
    f+

    with the three F+ each depending on the previous one, and also
    increasing the rounding errors. If you balance the tree

    a b c d
    \ / \ /
    f+ f+
    \ /
    f+

    corresponding to

    a b f+ c d f+ f+

    the first two f+ can run in parallel (increasing performance), and the
    rounding errors tend to be less.

    So how to implement this for an arbitrary N? We had an extensive
    discussion of a similar problem in the thread on the subject "balanced
    REDUCE: a challenge for the brave", and you can find that discussion
    at <https://comp.lang.forth.narkive.com/GIg9V9HK/balanced-reduce-a-challenge-for-the-brave>

    But I decided to use a recursive approach (recursive-sum, REC) that
    uses the largest 2^k<n as the left child and the rest as the right
    child, and as base cases for the recursion use a straight-line
    balanced-tree evaluation for 2^k with k<=7 (and combine these for n
    that are not 2^k). For systems with tiny FP stacks, I added the
    option to save intermediate results on a software stack in the
    recursive word. Concerning the straight-line code, it turned out that
    the highest k I could use on sf64 and vfx64 is 5 (corresponding to 6
    FP stack items); it's not clear to me why; on lxf I can use k=7 (and
    it uses the 387 stack, too).

    I also coded the shift-reduce-sum algorithm (shift-reduce-sum, SR)
    described in <https://en.wikipedia.org/wiki/Pairwise_summation> in
    Forth, because it can make use of Forth's features (such as the FP
    stack) where the C code has to hand-code it. It uses the FP stack
    beyond 8 elements if there are more than 128 elements in the array, so
    it does not work for the benchmark (with 100_000 elements in the
    array) on lxf, sf64, and vfx64. As you will see, this is no loss.

    I also coded the naive, sequential approach (naive-sum, NAI).

    One might argue that the straight-line stuff in REC puts REC at an
    advantage, so i also produced an unrolled version of the naive code (unrolled-sum, UNR) that uses straight-line sequences for adding up to
    2^7 elements to the intermediate result.

    You can find a file containing all these versions, compatibility
    configurations for various Forth systems, and testing and benchmarking
    code and data, on

    https://www.complang.tuwien.ac.at/forth/programs/pairwise-sum.4th

    I did not do any accuracy measurements, but I did performance
    measurements on a Ryzen 5800X:

    cycles:u
    gforth-fast iforth lxf SwiftForth VFX
    3_057_979_501 6_482_017_334 6_087_130_593 6_021_777_424 6_034_560_441 NAI
    6_601_284_920 6_452_716_125 7_001_806_497 6_606_674_147 6_713_703_069 UNR
    3_787_327_724 2_949_273_264 1_641_710_689 7_437_654_901 1_298_257_315 REC
    9_150_679_812 14_634_786_781 SR

    cycles:u
    gforth-fast iforth lxf SwiftForth VFX 13_113_842_702 6_264_132_870 9_011_308_923 11_011_828_048 8_072_637_768 NAI
    6_802_702_884 2_553_418_501 4_238_099_417 11_277_658_203 3_244_590_981 UNR
    9_370_432_755 4_489_562_792 4_955_679_285 12_283_918_226 3_915_367_813 REC 51_113_853_111 29_264_267_850 SR

    The versions used are:
    Gforth 0.7.9_20250625
    iForth 5.1-mini
    lxf 1.7-172-983
    SwiftForth x64-Linux 4.0.0-RC89
    VFX Forth 64 5.43 [build 0199] 2023-11-09

    The ":u" means that I measured what happened at the user-level, not at
    the kernel-level.

    Each benchmark run performs 1G f@ and f+ operations, and the naive
    approach performs 1G iterations of the loop.

    The NAIve and UNRolled results show that performance in both is
    limited by the latency of the F+: 3 cycles for the DP SSE2 operation
    in Gforth-fast, 6 cycles for the 80-bit 387 fadd on the other systems.
    It's unclear to me why UNR is much slower on gforth-fast compared to
    NAI.

    The RECursive balanced-tree sum is faster on iForth, lxf and VFX than
    the NAIve and UNRolled versions. It is slower on Gforth: My guess is
    that, despite all hardware advances, the lack of multi-state stack
    caching in Gforth means that the hardware of the Ryzen 5800X does not
    just see the real data flow, but a lot of additional dependences; or
    it may be related to whatever causes the slowdown for UNRolled.

    The SR (shift-reduce) sum looks cute, but performs so many additional instructions, even on iForth, that it is uncompetetive. It's unclear
    to me what slows it down so much on iForth, however.

    I expect that vectorized implementations using AVX will be several
    times faster than the fastest scalar stuff we see here.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Wed Jul 16 15:39:26 2025
    From Newsgroup: comp.lang.forth

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    I did not do any accuracy measurements, but I did performance
    measurements on a Ryzen 5800X:

    cycles:u
    gforth-fast iforth lxf SwiftForth VFX 3_057_979_501 6_482_017_334 6_087_130_593 6_021_777_424 6_034_560_441 NAI 6_601_284_920 6_452_716_125 7_001_806_497 6_606_674_147 6_713_703_069 UNR 3_787_327_724 2_949_273_264 1_641_710_689 7_437_654_901 1_298_257_315 REC 9_150_679_812 14_634_786_781 SR

    cycles:u

    This second table is about instructions:u

    gforth-fast iforth lxf SwiftForth VFX
    13_113_842_702 6_264_132_870 9_011_308_923 11_011_828_048 8_072_637_768 NAI
    6_802_702_884 2_553_418_501 4_238_099_417 11_277_658_203 3_244_590_981 UNR 9_370_432_755 4_489_562_792 4_955_679_285 12_283_918_226 3_915_367_813 REC
    51_113_853_111 29_264_267_850 SR

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From minforth@minforth@gmx.net to comp.lang.forth on Wed Jul 16 18:15:08 2025
    From Newsgroup: comp.lang.forth

    Am 16.07.2025 um 13:25 schrieb Anton Ertl:
    I did not do any accuracy measurements, but I did performance
    measurements
    YMMV but "fast but wrong" would not be my goal. ;-)

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Wed Jul 16 16:02:41 2025
    From Newsgroup: comp.lang.forth

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    But I decided to use a recursive approach (recursive-sum, REC) that
    uses the largest 2^k<n as the left child and the rest as the right
    child, and as base cases for the recursion use a straight-line
    balanced-tree evaluation for 2^k with k<=7 (and combine these for n
    that are not 2^k). For systems with tiny FP stacks, I added the
    option to save intermediate results on a software stack in the
    recursive word. Concerning the straight-line code, it turned out that
    the highest k I could use on sf64 and vfx64 is 5 (corresponding to 6
    FP stack items); it's not clear to me why; on lxf I can use k=7 (and
    it uses the 387 stack, too).

    Actually, after writing that, I found out the reasons for the FP stack overflows, and in the published versions and the results I use k=7 on
    all systems. It's really easy to leave an FP stack item on the FP
    stack while calling another word, and that's not so good if you do it
    while calling sum128:-).

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Wed Jul 16 16:23:03 2025
    From Newsgroup: comp.lang.forth

    minforth <minforth@gmx.net> writes:
    Am 16.07.2025 um 13:25 schrieb Anton Ertl:
    I did not do any accuracy measurements, but I did performance
    measurements
    YMMV but "fast but wrong" would not be my goal. ;-)

    I did test correctness with cases where roundoff errors do not play a
    role.

    As mentioned, the RECursive balanced-tree sum (which is also the
    fastest on several systems and absolutely) is expected to be more
    accurate in those cases where roundoff errors do play a role. But if
    you care about that, better design a test and test it yourself. It
    will be interesting to see how you find out which result is more
    accurate when they differ.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From minforth@minforth@gmx.net to comp.lang.forth on Wed Jul 16 19:17:16 2025
    From Newsgroup: comp.lang.forth

    Am 16.07.2025 um 18:23 schrieb Anton Ertl:
    minforth <minforth@gmx.net> writes:
    Am 16.07.2025 um 13:25 schrieb Anton Ertl:
    I did not do any accuracy measurements, but I did performance
    measurements
    YMMV but "fast but wrong" would not be my goal. ;-)

    I did test correctness with cases where roundoff errors do not play a
    role.

    As mentioned, the RECursive balanced-tree sum (which is also the
    fastest on several systems and absolutely) is expected to be more
    accurate in those cases where roundoff errors do play a role. But if
    you care about that, better design a test and test it yourself. It
    will be interesting to see how you find out which result is more
    accurate when they differ.

    Meanwhile many years ago, comparative tests were carried out with a
    couple of representative archived serial data (~50k samples) by
    using a Java 128-bit quadruple fp-math class to perform summations
    and calculate dot-product results.

    The results were compared with those of naive linear summation and multiplication and pairwise divide&conquer summation at different
    rounding modes, for float32 and float64. Ultimately, Kahan summation
    was the winner. It is slow, but there were no in-the-loop
    requirements, so for a background task, Kahan was fast enough.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Wed Jul 16 21:12:13 2025
    From Newsgroup: comp.lang.forth

    Well, that is strange ...

    Results with the current iForth are quite different:

    FORTH> bench ( see file quoted above + usual iForth timing words )
    \ 7963 times
    \ naive-sum : 0.999 seconds elapsed. ( 4968257259 )
    \ unrolled-sum : 1.004 seconds elapsed. ( 4968257259 )
    \ recursive-sum : 0.443 seconds elapsed. ( 4968257259 )
    \ shift-reduce-sum : 2.324 seconds elapsed. ( 4968257259 ) ok

    So here recursive-sum is by far the fastest, and shift-reduce-sum
    is not horribly slow. The slowdown in srs is because the 2nd loop
    is using the external stack.

    -marcel

    PS: Because of recent user requests a development snapshot was
    made available at the usual place.

    --
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From dxf@dxforth@gmail.com to comp.lang.forth on Thu Jul 17 15:55:41 2025
    From Newsgroup: comp.lang.forth

    On 16/07/2025 6:25 pm, LIT wrote:
    It depends on how many are being programmed by the likes of GCC.
    When ATMEL hit the market the manufacturer claimed their chips
    were designed with compilers in mind.  Do Arduino users program
    in hand-coded assembler?  Do you?  It's no longer just the chip's
    features and theoretical performance one has to worry about but
    the compilers too.

    Regarding features it's worth to mention
    that ATMELs actually are quite nice to
    program them in ML. Even, if they were
    designed "with compilers in mind".
    ...

    Reminds me of the 6502 for some reason. But it's the 'skip next
    instruction on bit in register' that throws me. Not to mention
    companies that release chips that don't do what the spec says.
    Their solution? Amend the documentation to exclude that feature!

    Didn't get that in the good old days as products were expected to
    have a reasonable lifetime. Today CPU designs are as 'throw away'
    as everything else. No reason to believe RISC-V will be different.
    Only thing distinguishing it are the years of hype and promise.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From peter@peter.noreply@tin.it to comp.lang.forth on Thu Jul 17 10:14:00 2025
    From Newsgroup: comp.lang.forth

    On Wed, 16 Jul 2025 15:39:26 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    I did not do any accuracy measurements, but I did performance
    measurements on a Ryzen 5800X:

    cycles:u
    gforth-fast iforth lxf SwiftForth VFX 3_057_979_501 6_482_017_334 6_087_130_593 6_021_777_424 6_034_560_441 NAI
    6_601_284_920 6_452_716_125 7_001_806_497 6_606_674_147 6_713_703_069 UNR
    3_787_327_724 2_949_273_264 1_641_710_689 7_437_654_901 1_298_257_315 REC
    9_150_679_812 14_634_786_781 SR

    cycles:u

    This second table is about instructions:u

    gforth-fast iforth lxf SwiftForth VFX
    13_113_842_702 6_264_132_870 9_011_308_923 11_011_828_048 8_072_637_768 NAI
    6_802_702_884 2_553_418_501 4_238_099_417 11_277_658_203 3_244_590_981 UNR
    9_370_432_755 4_489_562_792 4_955_679_285 12_283_918_226 3_915_367_813 REC
    51_113_853_111 29_264_267_850 SR

    - anton
    I have run this test now on my Ryzen 9950X for lxf, lxf64 ans a snapshot of gforth
    Here are the results
    Ryzen 9950X
    lxf64
    5,010,566,495 NAI cycles:u
    2,011,359,782 UNR cycles:u
    646,926,001 REC cycles:u
    3,589,863,082 SR cycles:u
    lxf64
    7,019,247,519 NAI instructions:u
    4,128,689,843 UNR instructions:u
    4,643,499,656 REC instructions:u
    25,019,182,759 SR instructions:u
    gforth-fast 20250219
    2,048,316,578 NAI cycles:u
    7,157,520,448 UNR cycles:u
    3,589,638,677 REC cycles:u
    17,199,889,916 SR cycles:u
    gforth-fast 20250219
    13,107,999,739 NAI instructions:u
    6,789,041,049 UNR instructions:u
    9,348,969,966 REC instructions:u
    50,108,032,223 SR instructions:u
    lxf
    6,005,617,374 NAI cycles:u
    6,004,157,635 UNR cycles:u
    1,303,627,835 REC cycles:u
    9,187,422,499 SR cycles:u
    lxf
    9,010,888,196 NAI instructions:u
    4,237,679,129 UNR instructions:u
    4,955,258,040 REC instructions:u
    26,018,680,499 SR instructions:u
    Doing the milliseconds timing gives
    lxf64 native code
    timer-reset ' naive-sum bench .elapsed 889 ms elapsed ok
    timer-reset ' unrolled-sum bench .elapsed 360 ms elapsed ok
    timer-reset ' recursive-sum bench .elapsed 114 ms elapsed ok
    timer-reset ' shift-reduce-sum bench .elapsed 647 ms elapsed ok
    lxf64 token code
    timer-reset ' naive-sum bench .elapsed 2´284 ms elapsed ok
    timer-reset ' unrolled-sum bench .elapsed 2´723 ms elapsed ok
    timer-reset ' recursive-sum bench .elapsed 3´474 ms elapsed ok
    timer-reset ' shift-reduce-sum bench .elapsed 6´842 ms elapsed ok
    lxf
    timer-reset ' naive-sum bench .elapsed 1073 milli-seconds ok timer-reset ' unrolled-sum bench .elapsed 1103 milli-seconds ok timer-reset ' recursive-sum bench .elapsed 234 milli-seconds ok timer-reset ' shift-reduce-sum bench .elapsed 1632 milli-seconds ok
    It is interesting to note how the Best algorithm" change depending
    on the underlying system implementation.
    lxf uses the x87 builtin fp stack, lxf64 uses sse4 and a large fp stack
    Thanks for these tests, they uncovered a problem with the lxf64 code
    generator. It could only handle 114 immediate values in a basic block!
    Both sum128 and nsum128 compiles gigantic functions of over 2k compile code. Best Regards
    Peter
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Thu Jul 17 09:35:25 2025
    From Newsgroup: comp.lang.forth

    Reminds me of the 6502 for some reason. But it's the 'skip next
    instruction on bit in register' that throws me.

    Nothing too unusual. It's actually just an abbreviation
    for something like, for example:

    CMP AX, BX
    JZ SHORT skip
    CALL something
    skip: ...

    So instead of separate CMP and JZ we've got
    "CMP?JZ" as single instruction. If not the
    variable size of instruction in x86, one could
    devise a macro. On a second thought: probably
    in A86 it'll be possible to devise such a macro,
    because its macro facility treats macro
    parameters character-wise. So probably a macro
    like 'CMP?JZ reg1,reg2 next_instruction" should
    be possible (I'll try that later).

    PIC features similar instructions (INCFSZ/DECFSZ).
    PIC is actually more 6502-like, with its spartan
    instruction set (when compared to ATMEL).

    Didn't get that in the good old days as products were expected to
    have a reasonable lifetime. Today CPU designs are as 'throw away'
    as everything else. No reason to believe RISC-V will be different.
    Only thing distinguishing it are the years of hype and promise.

    Well, at least x86 and ARM seem to be more 'persistent'.
    Actually they already proved to be.

    --
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Thu Jul 17 12:41:45 2025
    From Newsgroup: comp.lang.forth

    mhx@iae.nl (mhx) writes:
    Well, that is strange ...

    Results with the current iForth are quite different:

    FORTH> bench ( see file quoted above + usual iForth timing words )
    \ 7963 times
    \ naive-sum : 0.999 seconds elapsed. ( 4968257259 )
    \ unrolled-sum : 1.004 seconds elapsed. ( 4968257259 )
    \ recursive-sum : 0.443 seconds elapsed. ( 4968257259 )
    \ shift-reduce-sum : 2.324 seconds elapsed. ( 4968257259 ) ok

    Assuming that you were also using a Ryzen 5800X (or other Zen3-based
    CPU), running at 4.8GHz, accounting for the different number of
    iteratons, and that basically all the elapsed time is due to user
    cycles of our benchmark, I defined:

    : scale s>f 4.8e9 f/ 10000e f/ 7963e f* ;

    The output should be the approximate number of seconds. Here's what I
    get from the cycle:u numbers for iForth 5.1-mini given in the earlier
    postings:

    \ ------------ input ---------- | output
    6_482_017_334 scale 7 5 3 f.rdp 1.07534 ok
    6_452_716_125 scale 7 5 3 f.rdp 1.07048 ok
    2_949_273_264 scale 7 5 3 f.rdp 0.48927 ok
    14_634_786_781 scale 7 5 3 f.rdp 2.42785 ok

    The resulting numbers are not very different from those you show. My measurements include iForth's startup overhead, which may be one
    explanation why they are a little higher.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Thu Jul 17 12:54:29 2025
    From Newsgroup: comp.lang.forth

    peter <peter.noreply@tin.it> writes:
    Ryzen 9950X

    lxf64
    5,010,566,495 NAI cycles:u
    2,011,359,782 UNR cycles:u
    646,926,001 REC cycles:u
    3,589,863,082 SR cycles:u

    lxf64 =20
    7,019,247,519 NAI instructions:u =20
    4,128,689,843 UNR instructions:u =20
    4,643,499,656 REC instructions:u=20
    25,019,182,759 SR instructions:u=20


    gforth-fast 20250219
    2,048,316,578 NAI cycles:u
    7,157,520,448 UNR cycles:u
    3,589,638,677 REC cycles:u
    17,199,889,916 SR cycles:u

    gforth-fast 20250219
    13,107,999,739 NAI instructions:u=20
    6,789,041,049 UNR instructions:u
    9,348,969,966 REC instructions:u=20
    50,108,032,223 SR instructions:u=20

    lxf
    6,005,617,374 NAI cycles:u
    6,004,157,635 UNR cycles:u
    1,303,627,835 REC cycles:u
    9,187,422,499 SR cycles:u

    lxf
    9,010,888,196 NAI instructions:u
    4,237,679,129 UNR instructions:u=20
    4,955,258,040 REC instructions:u=20
    26,018,680,499 SR instructions:u

    lxf uses the x87 builtin fp stack, lxf64 uses sse4 and a large fp stack=20

    Apparently the latency of ADDSD (SSE2) is down to 2 cycles on Zen5
    (visible in lxf64 UNR and gforth-fast NAI) while the latency of FADD
    (387) is still 6 cycles (lxf NAI and UNR). I have no explanation why
    on lxf64 NAI performs so much worse than UNR, and in gforth-fast UNR
    so much worse than NAI.

    For REC the latency should not play a role. There lxf64 performs at
    7.2IPC and 1.55 F+/cycle, whereas lxf performs only at 3.8IPC and 0.77 F+/cycle. My guess is that FADD can only be performed by one FPU, and
    that's connected to one dispatch port, and other instructions also
    need or are at least assigned to this dispatch port.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Thu Jul 17 13:56:36 2025
    From Newsgroup: comp.lang.forth

    minforth <minforth@gmx.net> writes:
    Meanwhile many years ago, comparative tests were carried out with a
    couple of representative archived serial data (~50k samples)

    Representative of what? Serial: what series?

    Anyway, since I don't have these data, I won't repeat this experiment
    with the routines I have written.

    Ultimately, Kahan summation
    was the winner. It is slow, but there were no in-the-loop
    requirements, so for a background task, Kahan was fast enough.

    I wanted to see how slow, so I added KAHAN-SUM to

    https://www.complang.tuwien.ac.at/forth/programs/pairwise-sum.4th

    and on the Ryzen 5800X I got (data for the other routines from the
    earlier posting):

    cycles:u
    gforth-fast iforth lxf SwiftForth VFX
    3_057_979_501 6_482_017_334 6_087_130_593 6_021_777_424 6_034_560_441 NAI
    6_601_284_920 6_452_716_125 7_001_806_497 6_606_674_147 6_713_703_069 UNR
    3_787_327_724 2_949_273_264 1_641_710_689 7_437_654_901 1_298_257_315 REC
    9_150_679_812 14_634_786_781 SR 57_819_112_550 28_621_991_440 28_431_247_791 28_409_857_650 28_462_276_524 KAH

    instructions:u
    gforth-fast iforth lxf SwiftForth VFX 13_113_842_702 6_264_132_870 9_011_308_923 11_011_828_048 8_072_637_768 NAI
    6_802_702_884 2_553_418_501 4_238_099_417 11_277_658_203 3_244_590_981 UNR
    9_370_432_755 4_489_562_792 4_955_679_285 12_283_918_226 3_915_367_813 REC 51_113_853_111 29_264_267_850 SR 54_114_197_272 18_264_494_804 21_011_621_955 27_012_178_800 20_072_845_336 KAH

    The versions used are still:
    Gforth 0.7.9_20250625
    iForth 5.1-mini
    lxf 1.7-172-983
    SwiftForth x64-Linux 4.0.0-RC89
    VFX Forth 64 5.43 [build 0199] 2023-11-09

    KAHan-sum is More than 20 times slower than REC on VFX64. The
    particular slowness of gforth-fast is probably due to the weaknesses
    of FP stack caching in Gforth.

    One can do something like Kahan summation also for pairwise addition.
    The base step (half of the additions) becomes simpler (no compensation
    in any input), but more complicated in the inner additions (one
    compensation each). The main benefit would be that several additions
    can be done in parallel, and the expected error is even smaller.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From minforth@minforth@gmx.net to comp.lang.forth on Thu Jul 17 18:02:56 2025
    From Newsgroup: comp.lang.forth

    Am 17.07.2025 um 15:56 schrieb Anton Ertl:
    minforth <minforth@gmx.net> writes:
    Meanwhile many years ago, comparative tests were carried out with a
    couple of representative archived serial data (~50k samples)

    Representative of what? Serial: what series?

    Measured process signals and machine vibrations.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From peter@peter.noreply@tin.it to comp.lang.forth on Thu Jul 17 22:48:25 2025
    From Newsgroup: comp.lang.forth

    On Thu, 17 Jul 2025 12:54:29 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    peter <peter.noreply@tin.it> writes:
    Ryzen 9950X

    lxf64
    5,010,566,495 NAI cycles:u
    2,011,359,782 UNR cycles:u
    646,926,001 REC cycles:u
    3,589,863,082 SR cycles:u

    lxf64 =20
    7,019,247,519 NAI instructions:u =20
    4,128,689,843 UNR instructions:u =20
    4,643,499,656 REC instructions:u=20
    25,019,182,759 SR instructions:u=20


    gforth-fast 20250219
    2,048,316,578 NAI cycles:u
    7,157,520,448 UNR cycles:u
    3,589,638,677 REC cycles:u
    17,199,889,916 SR cycles:u

    gforth-fast 20250219
    13,107,999,739 NAI instructions:u=20
    6,789,041,049 UNR instructions:u
    9,348,969,966 REC instructions:u=20
    50,108,032,223 SR instructions:u=20

    lxf
    6,005,617,374 NAI cycles:u
    6,004,157,635 UNR cycles:u
    1,303,627,835 REC cycles:u
    9,187,422,499 SR cycles:u

    lxf
    9,010,888,196 NAI instructions:u
    4,237,679,129 UNR instructions:u=20
    4,955,258,040 REC instructions:u=20
    26,018,680,499 SR instructions:u

    lxf uses the x87 builtin fp stack, lxf64 uses sse4 and a large fp stack=20

    Apparently the latency of ADDSD (SSE2) is down to 2 cycles on Zen5
    (visible in lxf64 UNR and gforth-fast NAI) while the latency of FADD
    (387) is still 6 cycles (lxf NAI and UNR). I have no explanation why
    on lxf64 NAI performs so much worse than UNR, and in gforth-fast UNR
    so much worse than NAI.

    For REC the latency should not play a role. There lxf64 performs at
    7.2IPC and 1.55 F+/cycle, whereas lxf performs only at 3.8IPC and 0.77 F+/cycle. My guess is that FADD can only be performed by one FPU, and
    that's connected to one dispatch port, and other instructions also
    need or are at least assigned to this dispatch port.

    - anton

    I did a test coding the sum128 as a code word with avx-512 instructions
    and got the following results

    285,584,376 cycles:u
    941,856,077 instructions:u

    timing was
    timer-reset ' recursive-sum bench .elapsed 51 ms elapsed

    so half the time of the original recursive.
    with 32 zmm registers I could have done a sum256 also

    the code is below for reference
    r13 is the fp stack pointer
    rbx top of stack
    xmm0 top of fp stack

    code asum128

    movsd [r13-0x8], xmm0
    lea r13, [r13-0x8]

    vmovapd zmm0, [rbx]
    vmovapd zmm1, [rbx+64]
    vmovapd zmm2, [rbx+128]
    vmovapd zmm3, [rbx+192]
    vmovapd zmm4, [rbx+256]
    vmovapd zmm5, [rbx+320]
    vmovapd zmm6, [rbx+384]
    vmovapd zmm7, [rbx+448]
    vmovapd zmm8, [rbx+512]
    vmovapd zmm9, [rbx+576]
    vmovapd zmm10, [rbx+640]
    vmovapd zmm11, [rbx+704]
    vmovapd zmm12, [rbx+768]
    vmovapd zmm13, [rbx+832]
    vmovapd zmm14, [rbx+896]
    vmovapd zmm15, [rbx+960]

    vaddpd zmm0, zmm0, zmm1
    vaddpd zmm2, zmm2, zmm3
    vaddpd zmm4, zmm4, zmm5
    vaddpd zmm6, zmm6, zmm7
    vaddpd zmm8, zmm8, zmm9
    vaddpd zmm10, zmm10, zmm11
    vaddpd zmm12, zmm12, zmm13
    vaddpd zmm14, zmm14, zmm15

    vaddpd zmm0, zmm0, zmm2
    vaddpd zmm4, zmm4, zmm6
    vaddpd zmm8, zmm8, zmm10
    vaddpd zmm12, zmm12, zmm14

    vaddpd zmm0, zmm0, zmm4
    vaddpd zmm8, zmm8, zmm12

    vaddpd zmm0, zmm0, zmm8

    ; Horizontal sum of zmm0

    vextractf64x4 ymm1, zmm0, 1
    vaddpd ymm2, ymm1, ymm0

    vextractf64x2 xmm3, ymm2, 1
    vaddpd ymm4, ymm3, ymm2

    vhaddpd xmm0, xmm4, xmm4

    ret
    end-code

    lxf64 uses a modified fasm as the backend assembler
    so full support for all instructions

    BR
    Peter


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Fri Jul 18 05:25:21 2025
    From Newsgroup: comp.lang.forth

    On Thu, 17 Jul 2025 12:41:45 +0000, Anton Ertl wrote:

    mhx@iae.nl (mhx) writes:
    Well, that is strange ...
    [..]
    The output should be the approximate number of seconds. Here's what I
    get from the cycle:u numbers for iForth 5.1-mini given in the earlier postings:

    \ ------------ input ---------- | output
    6_482_017_334 scale 7 5 3 f.rdp 1.07534 ok
    6_452_716_125 scale 7 5 3 f.rdp 1.07048 ok
    2_949_273_264 scale 7 5 3 f.rdp 0.48927 ok
    14_634_786_781 scale 7 5 3 f.rdp 2.42785 ok

    The resulting numbers are not very different from those you show. My measurements include iForth's startup overhead, which may be one
    explanation why they are a little higher.

    You are right, of course. I was confused by the original posting's
    second table (which showed #instructions but was labeled #cycles).

    ( For the record, I used #7963 iterations of the code to get
    approximately 1 second runtime for the naive implementation. )

    -marcel

    --
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From dxf@dxforth@gmail.com to comp.lang.forth on Fri Jul 18 17:44:28 2025
    From Newsgroup: comp.lang.forth

    On 14/07/2025 4:04 pm, Anton Ertl wrote:
    dxf <dxforth@gmail.com> writes:
    On 13/07/2025 7:01 pm, Anton Ertl wrote:
    ...
    For Forth, Inc. and MPE AFAIK their respective IA-32 Forth system was
    the only one with hardware FP for many years, so there probably was
    little pressure from users for bit-identical results with, say, SPARC,
    because they did not have a Forth system that ran on SPARC.

    What do you mean by "bit-identical results"? Since SSE2 comes without
    transcendentals (or basics such as FABS and FNEGATE) and implementers
    are expected to supply their own, if anything, I expect results across
    platforms and compilers to vary.

    There are operations for which IEEE 754 specifies the result to the
    last bit (except that AFAIK the representation of NaNs is not
    specified exactly), among them F+ F- F* F/ FSQRT, probably also
    FNEGATE and FABS. It does not specify the exact result for
    transcendental functions, but if your implementation performs the same bit-exact operations for computing a transcendental function on two
    IEEE 754 compliant platforms, the result will be bit-identical (if it
    is a number). So just use the same implementations of transcentental functions, and your results will be bit-identical; concerning the
    NaNs, if you find a difference, check if the involved values are NaNs.

    So in mandating bit-identical results, not only in calculations but also input/output, IEEE 754 is all about giving the illusion of truth in floating-point when, if anything, they should be warning users don't be
    fooled.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Fri Jul 18 15:34:05 2025
    From Newsgroup: comp.lang.forth

    dxf <dxforth@gmail.com> writes:
    So in mandating bit-identical results, not only in calculations but also >input/output

    I don't think that IEEE 754 specifies I/O, but I could be wrong.

    IEEE 754 is all about giving the illusion of truth in
    floating-point when, if anything, they should be warning users don't be >fooled.

    I don't think that IEEE 754 mentions truth. It does, however, specify
    the inexact "exception" (actually a flag), which allows you to find
    out if the results of the computations are exact or if some rounding
    was involved.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Jul 19 10:18:15 2025
    From Newsgroup: comp.lang.forth

    peter <peter.noreply@tin.it> writes:
    I did a test coding the sum128 as a code word with avx-512 instructions
    and got the following results

    285,584,376 cycles:u
    941,856,077 instructions:u

    timing was
    timer-reset ' recursive-sum bench .elapsed 51 ms elapsed

    so half the time of the original recursive.
    with 32 zmm registers I could have done a sum256 also

    One could do sum128 with just 8 registers by performing the adds ASAP,
    i.e., for sum32

    vmovapd zmm0, [rbx]
    vmovapd zmm1, [rbx+64]
    vaddpd zmm0, zmm0, zmm1
    vmovapd zmm1, [rbx+128]
    vmovapd zmm2, [rbx+192]
    vaddpd zmm1, zmm1, zmm2
    vaddpd zmm0, zmm0, zmm1
    ; and then the Horizontal sum

    And you can code this as:

    vmovapd zmm0, [rbx]
    vaddpd zmm0, zmm0, [rbx+64]
    vmovapd zmm1, [rbx+128]
    vaddpd zmm1, zmm1, [rbx+192]
    vaddpd zmm0, zmm0, zmm1
    ; and then the Horizontal sum

    ; Horizontal sum of zmm0

    vextractf64x4 ymm1, zmm0, 1
    vaddpd ymm2, ymm1, ymm0

    vextractf64x2 xmm3, ymm2, 1
    vaddpd ymm4, ymm3, ymm2

    vhaddpd xmm0, xmm4, xmm4

    Instead of doing the horizontal sum once for every sum128, it might be
    more efficient (assuming the whole thing is not
    cache-bandwidth-limited) to have the result of sum128 be a full SIMD
    width, and then add them up with vaddpd instead of addsd, and do the
    horizontal sum once in the end.

    But if the recursive part is to be programmed in Forth, we would need
    a way to represent a SIMD width of data in Forth, maybe with a SIMD
    stack. I see a few problems there:

    * What to do about the mask registers of AVX-512? In the RISC-V
    vector extension masks are stored in regular SIMD registers.

    * There is a trend visible in ARM SVE and the RISC-V Vector extension
    to have support for dealing with loops across longer vectors. Do we
    also need to support something like that.

    For the RISC-V vector extension, see <https://riscv.org/wp-content/uploads/2024/12/15.20-15.55-18.05.06.VEXT-bcn-v1.pdf>

    One way to deal with all that would be to have a long-vector stack and
    have something like my vector wordset
    <https://github.com/AntonErtl/vectors>, where the sum of a vector
    would be a word that is implemented in some lower-level way (e.g.,
    assembly language); the sum of a vector is actually a planned, but not
    yet existing feature of this wordset.

    An advantage of having a (short) SIMD stack would be that one could
    use SIMD operations for other uses where the long-vector wordset looks
    too heavy-weight (or would need optimizations to get rid of the
    long-vector overhead). The question is if enough such uses exist to
    justify adding such a stack.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From minforth@minforth@gmx.net to comp.lang.forth on Sat Jul 19 13:53:20 2025
    From Newsgroup: comp.lang.forth

    Am 19.07.2025 um 12:18 schrieb Anton Ertl:

    One way to deal with all that would be to have a long-vector stack and
    have something like my vector wordset
    <https://github.com/AntonErtl/vectors>, where the sum of a vector
    would be a word that is implemented in some lower-level way (e.g.,
    assembly language); the sum of a vector is actually a planned, but not
    yet existing feature of this wordset.


    Not wanting to sound negative, but who in practice adds up long
    vectors, apart from testing compilers and fp-arithmetic?

    Dot products, on the other hand, are fundamental for many linear
    algebra algorithms, eg. matrix multiplication and AI.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From peter@peter.noreply@tin.it to comp.lang.forth on Sat Jul 19 15:24:48 2025
    From Newsgroup: comp.lang.forth

    On Sat, 19 Jul 2025 10:18:15 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    peter <peter.noreply@tin.it> writes:
    I did a test coding the sum128 as a code word with avx-512 instructions
    and got the following results

    285,584,376 cycles:u
    941,856,077 instructions:u

    timing was
    timer-reset ' recursive-sum bench .elapsed 51 ms elapsed

    so half the time of the original recursive.
    with 32 zmm registers I could have done a sum256 also

    One could do sum128 with just 8 registers by performing the adds ASAP,
    i.e., for sum32

    vmovapd zmm0, [rbx]
    vmovapd zmm1, [rbx+64]
    vaddpd zmm0, zmm0, zmm1
    vmovapd zmm1, [rbx+128]
    vmovapd zmm2, [rbx+192]
    vaddpd zmm1, zmm1, zmm2
    vaddpd zmm0, zmm0, zmm1
    ; and then the Horizontal sum

    And you can code this as:

    vmovapd zmm0, [rbx]
    vaddpd zmm0, zmm0, [rbx+64]
    vmovapd zmm1, [rbx+128]
    vaddpd zmm1, zmm1, [rbx+192]
    vaddpd zmm0, zmm0, zmm1
    ; and then the Horizontal sum

    ; Horizontal sum of zmm0

    vextractf64x4 ymm1, zmm0, 1
    vaddpd ymm2, ymm1, ymm0

    vextractf64x2 xmm3, ymm2, 1
    vaddpd ymm4, ymm3, ymm2

    vhaddpd xmm0, xmm4, xmm4

    the simd instructions does also take a memory operand
    I can du sum128 as

    code asum128b

    movsd [r13-0x8], xmm0
    lea r13, [r13-0x8]

    vmovapd zmm0, [rbx]
    vaddpd zmm0, zmm0, [rbx+64]
    vaddpd zmm0, zmm0, [rbx+128]
    vaddpd zmm0, zmm0, [rbx+192]
    vaddpd zmm0, zmm0, [rbx+256]
    vaddpd zmm0, zmm0, [rbx+320]
    vaddpd zmm0, zmm0, [rbx+384]
    vaddpd zmm0, zmm0, [rbx+448]
    vaddpd zmm0, zmm0, [rbx+512]
    vaddpd zmm0, zmm0, [rbx+576]
    vaddpd zmm0, zmm0, [rbx+640]
    vaddpd zmm0, zmm0, [rbx+704]
    vaddpd zmm0, zmm0, [rbx+768]
    vaddpd zmm0, zmm0, [rbx+832]
    vaddpd zmm0, zmm0, [rbx+896]
    vaddpd zmm0, zmm0, [rbx+960]


    ; Horizontal sum of zmm0

    vextractf64x4 ymm1, zmm0, 1
    vaddpd ymm2, ymm1, ymm0

    vextractf64x2 xmm3, ymm2, 1
    vaddpd ymm4, ymm3, ymm2

    vpermilpd xmm5, xmm4, 1
    vaddsd xmm0, xmm4, xmm5


    ret
    end-code

    this compiles to 154 bytes and 25 instructions
    The original sum128 is 2157 bytes and 513 instructions!

    Yes the horizontal sum should just be done once.
    I have only replaced sum128 with simd as a test.
    Later I will do a complete example

    This asum128b does not change the timing but reduces
    the number of instructions

    277,333,790 cycles:u
    834,846,183 instructions:u # 3.01 insn per cycle



    Instead of doing the horizontal sum once for every sum128, it might be
    more efficient (assuming the whole thing is not
    cache-bandwidth-limited) to have the result of sum128 be a full SIMD
    width, and then add them up with vaddpd instead of addsd, and do the horizontal sum once in the end.

    But if the recursive part is to be programmed in Forth, we would need
    a way to represent a SIMD width of data in Forth, maybe with a SIMD
    stack. I see a few problems there:

    * What to do about the mask registers of AVX-512? In the RISC-V
    vector extension masks are stored in regular SIMD registers.

    * There is a trend visible in ARM SVE and the RISC-V Vector extension
    to have support for dealing with loops across longer vectors. Do we
    also need to support something like that.

    For the RISC-V vector extension, see <https://riscv.org/wp-content/uploads/2024/12/15.20-15.55-18.05.06.VEXT-bcn-v1.pdf>

    One way to deal with all that would be to have a long-vector stack and
    have something like my vector wordset
    <https://github.com/AntonErtl/vectors>, where the sum of a vector
    would be a word that is implemented in some lower-level way (e.g.,
    assembly language); the sum of a vector is actually a planned, but not
    yet existing feature of this wordset.

    An advantage of having a (short) SIMD stack would be that one could
    use SIMD operations for other uses where the long-vector wordset looks
    too heavy-weight (or would need optimizations to get rid of the
    long-vector overhead). The question is if enough such uses exist to
    justify adding such a stack.

    - anton

    I will take a look at your vector implementation and see if it can be used
    in lxf64

    BR
    Peter

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Jul 19 14:39:42 2025
    From Newsgroup: comp.lang.forth

    peter <peter.noreply@tin.it> writes:
    On Sat, 19 Jul 2025 10:18:15 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    [sum32][
    vmovapd zmm0, [rbx]
    vaddpd zmm0, zmm0, [rbx+64]
    vmovapd zmm1, [rbx+128]
    vaddpd zmm1, zmm1, [rbx+192]
    vaddpd zmm0, zmm0, zmm1
    ; and then the Horizontal sum

    ; Horizontal sum of zmm0

    vextractf64x4 ymm1, zmm0, 1
    vaddpd ymm2, ymm1, ymm0

    vextractf64x2 xmm3, ymm2, 1
    vaddpd ymm4, ymm3, ymm2

    vhaddpd xmm0, xmm4, xmm4

    the simd instructions does also take a memory operand
    I can du sum128 as

    code asum128b

    movsd [r13-0x8], xmm0
    lea r13, [r13-0x8]

    vmovapd zmm0, [rbx]
    vaddpd zmm0, zmm0, [rbx+64]
    vaddpd zmm0, zmm0, [rbx+128]
    vaddpd zmm0, zmm0, [rbx+192]
    vaddpd zmm0, zmm0, [rbx+256]
    vaddpd zmm0, zmm0, [rbx+320]
    vaddpd zmm0, zmm0, [rbx+384]
    vaddpd zmm0, zmm0, [rbx+448]
    vaddpd zmm0, zmm0, [rbx+512]
    vaddpd zmm0, zmm0, [rbx+576]
    vaddpd zmm0, zmm0, [rbx+640]
    vaddpd zmm0, zmm0, [rbx+704]
    vaddpd zmm0, zmm0, [rbx+768]
    vaddpd zmm0, zmm0, [rbx+832]
    vaddpd zmm0, zmm0, [rbx+896]
    vaddpd zmm0, zmm0, [rbx+960]

    Yes, but that's not pairwise addition, so for these 16 adds you get
    worse avarage accuracy; if the CPU has limited OoO bufferering (maybe
    one of the Xeon Phis, but not anything modern that has AVX or
    AVX-512), you may also see some of the addition latency. You still
    get pairwise addition and its accuracy benefit for the horizontal sum
    and the recursive parts.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Jul 19 14:51:00 2025
    From Newsgroup: comp.lang.forth

    minforth <minforth@gmx.net> writes:
    Am 19.07.2025 um 12:18 schrieb Anton Ertl:

    One way to deal with all that would be to have a long-vector stack and
    have something like my vector wordset
    <https://github.com/AntonErtl/vectors>, where the sum of a vector
    would be a word that is implemented in some lower-level way (e.g.,
    assembly language); the sum of a vector is actually a planned, but not
    yet existing feature of this wordset.


    Not wanting to sound negative, but who in practice adds up long
    vectors, apart from testing compilers and fp-arithmetic?

    Everyone who does dot-products.

    Dot products, on the other hand, are fundamental for many linear
    algebra algorithms, eg. matrix multiplication and AI.

    If I add a vector-sum word

    df+red ( dfv -- r )
    \ r is the sum of the elements of dfv

    to the vector wordset, then the dot-product is:

    : dot-product ( dfv1 dfv2 -- r )
    df*v df+red ;

    Concerning matrix multiplication, while you can use the dot-product
    for it, there are many other ways to do it, and some are more
    efficient (although, admittedly, I have not used pairwise addition for
    these ways).

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From dxf@dxforth@gmail.com to comp.lang.forth on Sun Jul 20 13:16:17 2025
    From Newsgroup: comp.lang.forth

    On 19/07/2025 1:34 am, Anton Ertl wrote:
    dxf <dxforth@gmail.com> writes:
    So in mandating bit-identical results, not only in calculations but also
    input/output

    I don't think that IEEE 754 specifies I/O, but I could be wrong.

    They specify converting to/from external representation (aka ASCII).
    If the hardware does it for me - fine - but as an fp implementer no
    way am I going to jump hoops for IEEE.

    IEEE 754 is all about giving the illusion of truth in
    floating-point when, if anything, they should be warning users don't be
    fooled.

    I don't think that IEEE 754 mentions truth. It does, however, specify
    the inexact "exception" (actually a flag), which allows you to find
    out if the results of the computations are exact or if some rounding
    was involved.

    AFAICS IEEE 754 offers nothing particularly useful for the end-user.
    Either one's fp application works - or it doesn't. IEEE hasn't changed
    that. IEEE's relevance is that it spurred Intel into making an FPU
    which in turn made implementing fp easy. Had Intel not integrated their
    FPU into the CPU effectively reducing the cost to the end-user to zero,
    IEEE would have remained a novelty.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From minforth@minforth@gmx.net to comp.lang.forth on Sun Jul 20 06:27:53 2025
    From Newsgroup: comp.lang.forth

    Am 19.07.2025 um 16:51 schrieb Anton Ertl:
    minforth <minforth@gmx.net> writes:
    Am 19.07.2025 um 12:18 schrieb Anton Ertl:

    One way to deal with all that would be to have a long-vector stack and
    have something like my vector wordset
    <https://github.com/AntonErtl/vectors>, where the sum of a vector
    would be a word that is implemented in some lower-level way (e.g.,
    assembly language); the sum of a vector is actually a planned, but not
    yet existing feature of this wordset.


    Not wanting to sound negative, but who in practice adds up long
    vectors, apart from testing compilers and fp-arithmetic?

    Everyone who does dot-products.

    Dot products, on the other hand, are fundamental for many linear
    algebra algorithms, eg. matrix multiplication and AI.

    If I add a vector-sum word

    df+red ( dfv -- r )
    \ r is the sum of the elements of dfv

    to the vector wordset, then the dot-product is:

    : dot-product ( dfv1 dfv2 -- r )
    df*v df+red ;

    Sure, slow hand is not just for guitar players.
    With FMA, one could traverse the vectors in one go.

    https://docs.nvidia.com/cuda/floating-point/index.html

    Concerning matrix multiplication, while you can use the dot-product
    for it, there are many other ways to do it, and some are more
    efficient (although, admittedly, I have not used pairwise addition for
    these ways).

    There are tons of algorithms depending on various matrix properties.

    Then, given a desktop and a fat CPU, LAPACK et al. are your friends.
    Embedded or special CPU .. is a different story.






    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Mon Jul 21 13:28:11 2025
    From Newsgroup: comp.lang.forth

    dxf <dxforth@gmail.com> writes:
    AFAICS IEEE 754 offers nothing particularly useful for the end-user.
    Either one's fp application works - or it doesn't. IEEE hasn't
    changed that.

    The purpose of IEEE FP was to improve the numerical accuracy of
    applications that used it as opposed to other formats.

    IEEE's relevance is that it spurred Intel into making an FPU which in
    turn made implementing fp easy.

    Exactly the opposite, Intel decided that it wanted to make an FPU and it
    wanted the FPU to have the best FP arithmetic possible. So it
    commissioned Kahan (a renowned FP expert) to design the FP format.
    Kahan said "Why not use the VAX format? It is pretty good". Intel said
    it didn't want pretty good, it wanted the best, so Kahan said "ok" and
    designed the 8087 format.

    The IEEE standardization process happened AFTER the 8087 was already in progress. Other manufacturers signed onto it, some of them overcoming
    initial resistance, after becoming convinced that it was the right
    thing.

    http://people.eecs.berkeley.edu/~wkahan/ieee754status/754story.html
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From dxf@dxforth@gmail.com to comp.lang.forth on Tue Jul 22 11:52:04 2025
    From Newsgroup: comp.lang.forth

    On 22/07/2025 6:28 am, Paul Rubin wrote:
    dxf <dxforth@gmail.com> writes:
    AFAICS IEEE 754 offers nothing particularly useful for the end-user.
    Either one's fp application works - or it doesn't. IEEE hasn't
    changed that.

    The purpose of IEEE FP was to improve the numerical accuracy of
    applications that used it as opposed to other formats.

    IEEE's relevance is that it spurred Intel into making an FPU which in
    turn made implementing fp easy.

    Exactly the opposite, Intel decided that it wanted to make an FPU and it wanted the FPU to have the best FP arithmetic possible. So it
    commissioned Kahan (a renowned FP expert) to design the FP format.
    Kahan said "Why not use the VAX format? It is pretty good". Intel said
    it didn't want pretty good, it wanted the best, so Kahan said "ok" and designed the 8087 format.

    The IEEE standardization process happened AFTER the 8087 was already in progress. Other manufacturers signed onto it, some of them overcoming initial resistance, after becoming convinced that it was the right
    thing.

    http://people.eecs.berkeley.edu/~wkahan/ieee754status/754story.html

    There's nothing intrinsically "best" in IEEE's format. Best product
    on the market is what Intel wanted. It had been selling AMD's 9511 single-precision FPU under licence. As Kahan says, wind of what Intel
    was doing got out and industry's response was to create a standard
    that even Intel couldn't ignore.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From B. Pym@Nobody447095@here-nor-there.org to comp.lang.forth on Tue Jul 29 15:07:23 2025
    From Newsgroup: comp.lang.forth

    B. Pym wrote:

    mhx wrote:

    On Sun, 6 Oct 2024 7:51:31 +0000, dxf wrote:

    Is there an easier way of doing this? End goal is a double number representing centi-secs.


    empty decimal

    : SPLIT ( a u c -- a2 u2 a3 u3 ) >r 2dup r> scan 2swap 2 pick - ;
    : >INT ( adr len -- u ) 0 0 2swap >number 2drop drop ;

    : /T ( a u -- $hour $min $sec )
    2 0 do [char] : split 2swap dup if 1 /string then loop
    2 0 do dup 0= if 2rot 2rot then loop ;

    : .T 2swap 2rot cr >int . ." hr " >int . ." min " >int . ." sec " ;

    s" 1:2:3" /t .t
    s" 02:03" /t .t
    s" 03" /t .t
    s" 23:59:59" /t .t
    s" 0:00:03" /t .t

    Why don't you use the fact that >NUMBER returns the given
    string starting with the first unconverted character?
    SPLIT should be redundant.

    -marcel

    : CHAR-NUMERIC? 48 58 WITHIN ;
    : SKIP-NON-NUMERIC ( adr u -- adr2 u2)
    BEGIN
    DUP IF OVER C@ CHAR-NUMERIC? NOT ELSE 0 THEN
    WHILE
    1 /STRING
    REPEAT ;

    : SCAN-NEXT-NUMBER ( n adr len -- n2 adr2 len2)
    2>R 60 * 0. 2R> >NUMBER
    2>R D>S + 2R> ;

    : PARSE-TIME ( adr len -- seconds)
    0 -ROT
    BEGIN
    SKIP-NON-NUMERIC
    DUP
    WHILE
    SCAN-NEXT-NUMBER
    REPEAT
    2DROP ;

    S" hello 1::36 world" PARSE-TIME CR .
    96 ok


    : get-number ( accum adr len -- accum' adr' len' )
    { adr len }
    0. adr len >number { adr' len' }
    len len' =
    if
    2drop adr len 1 /string
    else
    d>s swap 60 * +
    adr' len'
    then ;

    : parse-time ( adr len -- seconds)
    0 -rot
    begin
    dup
    while
    get-number
    repeat
    2drop ;

    s" foo-bar" parse-time . 0
    s" foo55bar" parse-time . 55
    s" foo 1 bar 55 zoo" parse-time . 155
    s" and9foo 1 bar 55 zoo" parse-time . 32515


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From B. Pym@Nobody447095@here-nor-there.org to comp.lang.forth on Tue Jul 29 15:22:17 2025
    From Newsgroup: comp.lang.forth

    B. Pym wrote:


    : get-number ( accum adr len -- accum' adr' len' )
    { adr len }
    0. adr len >number { adr' len' }
    len len' =
    if
    2drop adr len 1 /string
    else
    d>s swap 60 * +
    adr' len'
    then ;

    : parse-time ( adr len -- seconds)
    0 -rot
    begin
    dup
    while
    get-number
    repeat
    2drop ;

    s" foo-bar" parse-time . 0
    s" foo55bar" parse-time . 55
    s" foo 1 bar 55 zoo" parse-time . 155

    Actually prints 115.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From dxf@dxforth@gmail.com to comp.lang.forth on Wed Jul 30 03:35:09 2025
    From Newsgroup: comp.lang.forth

    On 30/07/2025 1:07 am, B. Pym wrote:
    ...
    : get-number ( accum adr len -- accum' adr' len' )
    { adr len }
    0. adr len >number { adr' len' }
    len len' =
    if
    2drop adr len 1 /string
    else
    d>s swap 60 * +
    adr' len'
    then ;

    : parse-time ( adr len -- seconds)
    0 -rot
    begin
    dup
    while
    get-number
    repeat
    2drop ;

    s" foo-bar" parse-time . 0
    s" foo55bar" parse-time . 55
    s" foo 1 bar 55 zoo" parse-time . 155
    s" and9foo 1 bar 55 zoo" parse-time . 32515

    : digit? ( c -- f ) 48 58 within ;

    : scan-digit ( a u -- a' u' )
    begin dup while
    over c@ digit? 0= while 1 /string
    repeat then ;

    : /number ( a u -- a' u' u2 )
    0. 2swap >number 2swap drop ;

    : parse-time ( adr len -- seconds)
    0 begin >r scan-digit dup while
    /number r> 60 * +
    repeat 2drop r> ;

    s" foo-bar" parse-time . 0 ok
    s" foo55bar" parse-time . 55 ok
    s" foo 1 bar 55 zoo" parse-time . 115 ok
    s" and9foo 1 bar 55 zoo" parse-time . 32515 ok

    --- Synchronet 3.21a-Linux NewsLink 1.2