• Actually... why =?UTF-8?B?bm90Pw==?=

    From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Wed Jun 11 17:50:22 2025
    From Newsgroup: comp.lang.forth

    Now there's something that bothers me:
    using PFA address of a variable seem to
    have actually only advantages; I don't
    see any disadvantages.
    So why the compilers won't work this
    way? I mean: when a variable if found,
    instead of its CFA - 'LIT PFA' should
    be compiled directly. When a constant
    is found — 'LIT <value>' should be compiled,
    instead of constant's CFA.

    Do I miss anything, any eventual problem?
    At the moment I don't see any. It should
    work properly in every situation. Yes, the
    compilation will be a tad slower - but
    the execution speed can be significantly
    increased.

    --
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Wed Jun 11 21:16:06 2025
    From Newsgroup: comp.lang.forth

    zbigniew2011@gmail.com (LIT) writes:
    I mean: when a variable if found,
    instead of its CFA - 'LIT PFA' should
    be compiled directly. When a constant
    is found — 'LIT <value>' should be compiled,
    instead of constant's CFA.

    That's what you get in Gforth since before 0.7:

    variable v ok
    5 constant c ok
    : foo v c ; ok
    simple-see foo
    $7FCBB58A0DF8 lit 0->0
    $7FCBB58A0E00 v
    $7FCBB58A0E08 lit 0->0
    $7FCBB58A0E10 #5
    $7FCBB58A0E18 ;s 0->0 ok

    The disassembly varies a little by version.

    Do I miss anything, any eventual problem?

    It requires more work in COMPILE, than just doing a ",". But having a user-extensible intelligent COMPILE, (like Gforth) offers a number of advantages, especially for native-code compilers.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Thu Jun 12 09:52:15 2025
    From Newsgroup: comp.lang.forth

    It requires more work in COMPILE, than just doing a ",". But having a user-extensible intelligent COMPILE, (like Gforth) offers a number of advantages, especially for native-code compilers.

    Indeed I din't check Gforth yesterday, focusing
    on compilers for MS-DOS.

    It's actually unbelievable! All it takes is rather
    minor modification in INTERPRET. So throughout
    all these years since 70s FORTH could execute
    the programs significantly faster - but they
    were all the time selling/giving away the listings
    that DIDN'T feature such advantageous change?
    And even today the compiler creators don't apply
    it, for no particular reason?

    Somehow it's beyond me; I don't understand why
    NOT to do it immediately and everywhere. "More
    work for a compiler"? It'll take it, I believe. :D
    Just modified that fig-Forth I'm tinkering with
    - works like a charm.

    --
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Thu Jun 12 10:08:02 2025
    From Newsgroup: comp.lang.forth

    zbigniew2011@gmail.com (LIT) writes:
    It requires more work in COMPILE, than just doing a ",". But having a
    user-extensible intelligent COMPILE, (like Gforth) offers a number of
    advantages, especially for native-code compilers.
    ...
    It's actually unbelievable! All it takes is rather
    minor modification in INTERPRET.

    It only requires a change to COMPILE,. No change in INTERPRET.

    So throughout
    all these years since 70s FORTH could execute
    the programs significantly faster - but they
    were all the time selling/giving away the listings
    that DIDN'T feature such advantageous change?

    In the 1970s and early 1980s the bigger problem was code size rather
    than code performance. And if you compile a variable or constant into
    the CFA of the variable, this costs one cell, whereas compiling it
    into LIT followed by the address or value costs two cells. You can
    try this out in Gforth, which includes a traditional-style ITC engine
    (that compiles with "," in nearly all cases) and an engine that uses
    an intelligent COMPILE,. When you do

    variable v
    5 constant c
    : foo v c ;
    simple-see foo

    the output is:

    gforth
    $7F728BAA0DF8 lit 0->0 $7F23366A0E10 v
    $7F728BAA0E00 v $7F23366A0E18 c
    $7F728BAA0E08 lit 0->0 $7F23366A0E20 ;s
    $7F728BAA0E10 #5
    $7F728BAA0E18 ;s 0->0

    As for performance, here is what I measure on gforth-itc:

    sieve bubble matrix fib fft compile,
    0.173 0.187 0.142 0.253 0.085 ,
    0.164 0.191 0.134 0.242 0.088 opt-compile,

    There is quite a bit of variation between the runs on the Zen4 machine
    where I measured this.

    Invocation with

    gforth-itc onebench.fs # for compiling with ,
    gforth-itc -e "' opt-compile, is compile," onebench.fs

    And even today the compiler creators don't apply
    it, for no particular reason?

    Which compiler creators do you have in mind? Those that compile for
    MS-DOS? With 64KB segments, they may prefer to be stingy with the
    code size.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Thu Jun 12 11:20:31 2025
    From Newsgroup: comp.lang.forth

    It only requires a change to COMPILE,. No change in INTERPRET.

    I modified the relevant branch of INTERPRET.

    So throughout
    all these years since 70s FORTH could execute
    the programs significantly faster - but they
    were all the time selling/giving away the listings
    that DIDN'T feature such advantageous change?

    In the 1970s and early 1980s the bigger problem was code size rather
    than code performance. And if you compile a variable or constant into
    the CFA of the variable, this costs one cell, whereas compiling it
    into LIT followed by the address or value costs two cells.

    Please, have a mercy... :D it's A SINGLE cell you're
    talking about. Even, if (let's assume) the two bytes
    may(?) have some meaning during 70s, still in the 80s -
    in the era, when 16 KB of RAM, and soon later 64 KB
    became de facto standard - it wasn't sane decision(?)
    to cripple the compiler(s) by "saving" (literally)
    a few bytes.

    And even today the compiler creators don't apply
    it, for no particular reason?

    Which compiler creators do you have in mind? Those that compile for
    MS-DOS? With 64KB segments, they may prefer to be stingy with the
    code size.

    64 KB is a whole lot compared to "savings"
    of (literally) two-byte size per VARIABLE/CONSTANT
    definition. Say we've got 200 of them together
    in the program — so 400 bytes has been "saved"
    at a cost of significantly degraded performance.

    --
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Thu Jun 12 11:23:34 2025
    From Newsgroup: comp.lang.forth

    Correction: not "per definition" but "per
    VARIABLE/CONSTANT use".

    --
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Thu Jun 12 16:21:16 2025
    From Newsgroup: comp.lang.forth

    zbigniew2011@gmail.com (LIT) writes:
    [Someone wrote:]
    In the 1970s and early 1980s the bigger problem was code size rather
    than code performance. And if you compile a variable or constant into
    the CFA of the variable, this costs one cell, whereas compiling it
    into LIT followed by the address or value costs two cells.

    Please, have a mercy... :D it's A SINGLE cell you're
    talking about.

    It's a doubling of the threaded code for every use of a variable,
    constant, colon definition (call <addr>), etc.

    Even, if (let's assume) the two bytes
    may(?) have some meaning during 70s, still in the 80s -
    in the era, when 16 KB of RAM, and soon later 64 KB
    became de facto standard - it wasn't sane decision(?)
    to cripple the compiler(s) by "saving" (literally)
    a few bytes.

    If it's only a few bytes, that would mean that variables, constants,
    colon definitions etc. are not invoked much. In that case the savings
    in run-time are also likely to be small.

    As for the relevance of the difference in 64KB, I know of Forth
    programs that use more than 64KB, so it's not as if 64KB make it
    unnecessary to be economical with space.

    I measured loading brew (without running a benchmark) into gforth-itc.
    When compiling with "," (default in gforth-itc), the size on a 64-bit
    system grew by 748560 bytes, while with "opt-compile,", it grew by
    939448 bytes. That's a factor 1.255 difference.

    64 KB is a whole lot compared to "savings"
    of (literally) two-byte size per VARIABLE/CONSTANT
    definition. Say we've got 200 of them together
    in the program — so 400 bytes has been "saved"
    at a cost of significantly degraded performance.

    If we take the factor 1.255, a program that fits into 64KB when
    compiling with "," needs more than 80KB with "OPT-COMPILE,". Now trim
    this factor from the program in order to make it fit. The easiest way
    is to go back to compiling with ",".

    As for "significantly degraded performance", as long as you stick with
    ITC, my results don't show that.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Thu Jun 12 17:49:05 2025
    From Newsgroup: comp.lang.forth

    As for "significantly degraded performance", as long as you stick with
    ITC, my results don't show that.

    Indeed I have to admit (after I did a few
    bechmarkings later), that on ITC the difference
    is actually negligible. Maybe this is the reason
    fig-Forth people didn't bother.
    But I'm going to test a few other DTC/STC Forths;
    curious about the results.

    --
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Thu Jun 12 21:01:46 2025
    From Newsgroup: comp.lang.forth

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    As for performance, here is what I measure on gforth-itc:

    sieve bubble matrix fib fft compile,
    0.173 0.187 0.142 0.253 0.085 ,
    0.164 0.191 0.134 0.242 0.088 opt-compile,

    There is quite a bit of variation between the runs on the Zen4 machine
    where I measured this.

    That's not particularly impressive, but this primitive-centric code is
    a stepping stone for a number of further changes which overall produce
    a very good speedup. I demonstrate this with the following sequence
    of invocations:

    gforth-itc onebench.fs
    #let's add primitive-centric code
    gforth-itc -e "' opt-compile, is compile," onebench.fs
    #now switch to direct-threaded code:
    gforth --no-dynamic --ss-number=0 onebench.fs
    #now allow dynamic superinstructions with replication:
    gforth --ss-number=0 --opt-ip-updates=0 onebench.fs
    #switch to benchmarking engine (less precision in error reporting):
    gforth-fast --ss-number=0 --ss-states=1 --opt-ip-updates=0 onebench.fs
    #swith on static stack caching with three registers:
    gforth-fast --ss-number=0 --opt-ip-updates=0 onebench.fs
    #optimize away most IP updates:
    gforth-fast --ss-number=0 onebench.fs
    #enabe static superinstructions:
    gforth-fast onebench.fs

    The results on a 5GHz Zen4 are (smaller is better):

    sieve bubble matrix fib fft
    0.173 0.184 0.142 0.247 0.085 gforth-itc
    0.163 0.190 0.134 0.238 0.089 let's add primitive-centric code
    0.164 0.187 0.130 0.246 0.085 now switch to direct-threaded code
    0.084 0.128 0.051 0.105 0.030 +dynamic superinstructions with replication
    0.053 0.061 0.032 0.049 0.018 switch to benchmarking engine
    0.053 0.059 0.031 0.042 0.015 +static stack caching with three registers
    0.020 0.021 0.011 0.027 0.013 +optimize away most IP updates
    0.020 0.021 0.011 0.027 0.012 +enabe static superinstructions

    As you can see, the overall effect of these changes is quite big.

    You may wonder what these funny words all mean. Here's a list of
    papers about these topics:

    primitive-centric code:
    https://www.complang.tuwien.ac.at/papers/ertl02.ps.gz

    dynamic superinstructions with replication: https://www.complang.tuwien.ac.at/papers/ertl%26gregg03.ps.gz

    static stack caching: https://www.complang.tuwien.ac.at/papers/ertl%26gregg05.ps.gz

    IP update optimization: https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ECOOP.2024.14

    Static superinstructions: https://www.complang.tuwien.ac.at/papers/ertl+02.ps.gz

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Fri Jun 13 00:12:38 2025
    From Newsgroup: comp.lang.forth

    It looks like the biggest improvement came from switching
    to the benchark engine. What does that mean?

    --
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Thu Jun 12 17:32:53 2025
    From Newsgroup: comp.lang.forth

    minforth@gmx.net (minforth) writes:
    It looks like the biggest improvement came from switching
    to the benchark engine. What does that mean?

    It means switching from the ITC interpreter to a faster one
    (gforth-fast) that uses a mixture of DTC and native code generation, if
    I have it right. The speedup is unsurprising.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Fri Jun 13 05:09:39 2025
    From Newsgroup: comp.lang.forth

    minforth@gmx.net (minforth) writes:
    It looks like the biggest improvement came from switching
    to the benchark engine. What does that mean?

    The speedup from switching to the benchmark engine means that the
    debugging features of the debugging engine have a cost. See below.

    However, the speedup factor from 1) adding dynamic superinstructions
    with replication and from optimizing away IP updates are higher for
    several benchmarks.

    As for the cost of debugging features, let's look at the code for

    : squared dup * ;

    for the two engines compared in this step, and for default gforth-fast
    (all optimizations enabled):

    debugging benchmarking benchmarking
    with ip updates, no multi-state stack caching all optimizations
    dup 0->0 dup 1->1 dup 1->2
    mov $50[r13],r15 add rbx,$08 mov r15,r13
    add r15,$08 mov [r10],r13
    mov rax,[r14] sub r10,$08
    sub r14,$08
    mov [r14],rax
    * 0->0 * 1->1 * 2->1
    mov $50[r13],r15 add rbx,$08 imul r13,r15
    add r15,$08 imul r13,$08[r10]
    mov rax,$08[r14] add r10,$08
    imul rax,[r14]
    add r14,$08
    mov [r14],rax
    ;s 0->0 ;s 1->1 ;s 1->1
    mov $50[r13],r15 mov rbx,[r14] mov rbx,[r14]
    mov rax,$58[r13] add r14,$08 add r14,$08
    mov r10,[rax] mov rax,[rbx] mov rax,[rbx]
    add rax,$08 jmp eax jmp eax
    mov $58[r13],rax
    mov r15,r10
    mov rcx,[r15]
    jmp ecx

    In the debugging engine, you see, at the start of each primitive, the instruction

    mov $50[r13],r15

    This saves the current instruction pointer. If there is a signal,
    e.g., because a stack underflow produces a segmentation violation, the
    signal handler can then save the instruction pointer in the backtrace
    and cause a Forth-level THROW, and the system CATCH handler can then
    report exactly where the stack underflow happened.

    In order for that to work, the signal handler also needs to know the
    return stack pointer, so in the debugging engine we don't keep the
    return stack pointer in a local variable (which ideally is kept in a
    register), but keep it in a struct, and we see the accesses to this
    struct in ";S":

    mov rax,$58[r13]
    ...
    mov $58[r13],rax

    The benchmarking engine does not have all these memory accesses.

    Moreover, in order to report stack underflows even in cases like DUP,
    the debugging engine keeps no stack item in a register across
    primitive boundaries, while the benchmarking engine keeps one stack
    item in the second column and 0-3 in the third column. So we see all
    these accesses through [r14] (the data stack pointer) in the debugging
    engine, while we see fewer accesses through [r10] (the data stack
    pointer in this engine in the second column, and no data stack memoru
    access in the third column.

    Moreover, the debugging engine keeps the item below the stack bottom
    in inaccessible memory, so that every stack underflow produces a
    signal. This does not cost additional performance.

    The bottom line is that in the debugging engine every stack underflow
    causes a SIGSEGV, and we get a backtrace that includes the primitive
    that caused the stack underflow:

    : squared dup * ; ok
    .s <0> ok
    squared
    *the terminal*:3:1: error: Stack underflow
    squared<<<
    Backtrace:
    *terminal*:1:11: 0 $7FFB668A0DA0 dup

    Gforth also keeps information about the
    source-code-to-instruction-pointer mapping, and reports the location
    of the source code ("*terminal*:1:11:") in addition to decompiling the
    involved word ("dup"). The "0" is the index of the backtrace entry
    (if you want to look at the code for this backtrace entry), and the "$7FFB668A0DA0" is the actual value of the return stack item in the
    backtrace.

    By contrast, the benchmarking engine does not notice the stack
    underflow in this case, and even in cases where a primitive causes a
    signal (e.g., when @ tries to access inaccessible memory), neither the instruction pointer nor the return stack pointer are available to the
    signal handler, so you get no backtrace from THROWs due to signals
    caused by primitives.

    These are the differences between the first and second column, i.e.,
    between the debugging and the benchmarking engine at otherwise the
    same optimization level (without optimizing IP updates away, and
    without multi-state stack caching). Let's look at the differences
    between the second and third column (all optimizations).

    The first difference is that the threaded-code instruction pointer
    updates are optimized away in the third column. In the second column,
    they are still present:

    add rbx,$08

    at the start of DUP and *. At the start of ;S there is no such
    update, because the instruction pointer rbx is overwritten by the load
    of the instruction pointer from the return stack.

    The other difference is that the second column always has one stack
    item in a register, whereas the third column supports different stack representations. In particular, the "dup 1->2" means that DUP starts
    with one stack item in a register, and finishes with two stack items
    in registers; "* 2->1" means that * starts with two stack items in
    registers and ends with one stack item in a register. This
    multi-state stack caching eliminates the overhead of storing stack
    items to memory, loading stack items from memory, and updating the
    data stack pointer (r10 in column 2).

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Fri Jun 13 06:49:18 2025
    From Newsgroup: comp.lang.forth

    Paul Rubin <no.email@nospam.invalid> writes:
    minforth@gmx.net (minforth) writes:
    It looks like the biggest improvement came from switching
    to the benchark engine. What does that mean?

    It means switching from the ITC interpreter to a faster one
    (gforth-fast) that uses a mixture of DTC and native code generation, if
    I have it right.

    Already the second step "now switch to direct-threaded code" switches
    to DTC (but not the debugging engine), and "+dynamic superinstructions
    with replication" introduces the mixture of DTC and native code
    generation. The switch to the benchmarking engine is the step after
    that, and I explained it in another posting.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
    --- Synchronet 3.21a-Linux NewsLink 1.2