• Rust, Forth and performance

    From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Nov 22 08:46:35 2025
    From Newsgroup: comp.lang.forth

    I explored the performance of threads and futures (aka async/await, a
    variant of cooperative multitasking) in Rust, and as an example
    program, I have one thread/future that generates numbers from 0 to <n (n=10_000_000 in my measurements), one that takes numbers and converts
    them into strings, one that takes strings and outputs lines (up to
    linelength chars), and finally one that takes lines and prints them.

    The results are (with release builds, on a 4.7GHz Rocketlake, output
    piped to wc):

    3.59user 2.09system 0:02.64elapsed 215%CPU fillthread
    2.55user 0.51system 0:03.06elapsed 100%CPU fillasync (futures)

    The output consists of just 79MB, and I expected that to take less
    time, so I also made a version that hast the whole logic in one piece
    instead of separated. That's easy in this case, because all but one
    part (constructing lines) is trivial. The result is:

    0.66user 0.42system 0:01.08elapsed 99%CPU fillseq (all-in-one)

    That's still a more than I expected from such a program, and I find
    the high proportion of system time especially surprising, so I wanted
    to check out how Forth systems compare. So I wrote a Forth program
    for doing the same thing, and here are the results:

    0.90user 0.00system 0:00.91elapsed 99%CPU gforth-fast
    2.07user 8.59system 0:10.67elapsed 99%CPU lxf
    1.80user 5.05system 0:06.87elapsed 99%CPU sf64
    1.49user 5.06system 0:06.56elapsed 99%CPU vfx64

    The large amount of system time in systems other than Gforth is due to
    these systems performing a system call everytime TYPE, SPACE, or CR
    are called, whereas Gforth uses buffered I/O and performs a system
    call only when the buffer is full. The large (but not as large as for
    some Forth systems) system time of the Rust binary indicates that
    there is lots of system calling going on there, too, probably one
    system call per line (in the Rust program, lines are built, and only
    then output).

    In order to check out how the Forth systems do without this system
    call overheads, I defined synonyms for TYPE, SPACE and CR that work as
    noops, apart from the stack effect. So when the programs do not
    output their results, the run times are:

    0.68user 0.01system 0:00.70elapsed 98%CPU gforth
    0.20user 0.00system 0:00.20elapsed 98%CPU lxf
    0.43user 0.00system 0:00.43elapsed 98%CPU sf64
    0.31user 0.00system 0:00.32elapsed 99%CPU vfx64

    I would have expected the programs to spend a lot of time in division
    (inside #S) and to see less performance differences between systems,
    so seeing a 3.5 times speedup of lxf over gforth-fast is surprising to
    me; apparently the rest costs quite a bit more than the divisions.

    Some performance counter results:

    3245981222 945394481 cycles
    11679661274 2648084410 instructions
    1391034028 447127429 branches
    1521428 1243916 branch-misses
    0.4 18.2 % tma_backend_bound
    3.9 6.1 % tma_bad_speculation
    24.6 4.3 % tma_frontend_bound
    71.1 71.4 % tma_retiring

    So, a lot of difference in the number of executed instructions. When
    drilling down the tma_retiring result (perf stat -M tma_retiring_group
    ...), only 1% of the slots in the gforth-fast result are
    tma_heavy_operations, whereas 10.6% are for lxf. I assume that the
    divisions are heavy operations; but the 10% mean that even lxf spends
    much of its time elsewhere; drilling further down gives a result of
    ~17% tma_few_uops_instructions, which are probably mostly division.

    What does this say about the performance of the program with output
    enabled? If lxf used buffered I/O from glibc, it would probably need
    another 0.2s (the difference between gforth-fast without and with
    output), but that would still be quite fast.

    I have not tried to disable the output in Rust, because I expect that
    the compiler would then eliminate parts of the rest (Forth compilers
    are not that sophisticated yet, at least not for stuff like "<# #s #>"
    if the result is eventually 2DROPped), so it's hard to tell how Rust
    would perform with buffered I/O (Doing buffered I/O is doable in Rust,
    but takes a little more work than I want to do at the moment). One
    problem is that it has to convert the number to a string, and it has
    to heap-allocate (and later free) that string. In my program, the concatenation of the strings is also a costly operation.

    Here are the programs:

    Forth:
    ---------------------------------------
    70 constant linelength

    \ synonym type 2drop
    \ synonym cr noop
    \ synonym space noop

    : main ( u1 -- )
    dup 0= if exit then
    ." 0"
    1 swap 1 ?do ( len )
    i 0 <# #s #> rot 2dup + 1+ linelength > if ( c-addr u len )
    if ( c-addr u )
    cr then
    tuck
    else
    space 1+ over + -rot then
    type loop
    if cr then ;
    ---------------------------------------------

    Rust:
    --------------------------------------------
    use std::env;

    fn main() {
    let linelength = 2;
    let args: Vec<String> = env::args().collect();
    let arg1 = args.get(1).expect("Usage: <executable> <n>").parse::<usize>().unwrap();
    if arg1>0 {
    let mut line = "0".to_string();
    for i in 1..arg1 {
    let received = i.to_string();
    if line.len()+1+received.len() > linelength {
    if line.len()>0 {
    println!("{line}");
    }
    line = received;
    } else {
    line = format!("{line} {received}");
    }
    }
    if line.len() > 0 {
    println!("{line}");
    }
    }
    }
    -----------------------------------------

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
    EuroForth 2025 registration: https://euro.theforth.net/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Sat Nov 22 12:30:04 2025
    From Newsgroup: comp.lang.forth

    In article <2025Nov22.094635@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    <SNIP>
    I have not tried to disable the output in Rust, because I expect that
    the compiler would then eliminate parts of the rest (Forth compilers
    are not that sophisticated yet, at least not for stuff like "<# #s #>"
    if the result is eventually 2DROPped),

    It is hard to see that <# #S #> could be optimised away because of side
    effects in the PAD area in normal implementations.
    1.0 <# #S #> TYPE SPACE HLD @ PAD OVER - TYPE

    10 10 OK
    1.0 <# #S #> 2DROP HLD @ PAD OVER - TYPE
    10 OK

    My optimiser detects side effect free sequences and eliminates
    those if the result is dropped, but that is not the case here .
    Only a lispified 1] Forth that eliminates HLD can pull this off.
    Even detecting on the spot that an allocated buffer can be freed,
    such that it is possible to eliminate allocation of that buffer
    (and subsequent freeing) were quite a feat.

    lispified : all temporary buffers are allocated.

    - anton
    --
    The Chinese government is satisfied with its military superiority over USA.
    The next 5 year plan has as primary goal to advance life expectancy
    over 80 years, like Western Europe.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Nov 22 11:38:15 2025
    From Newsgroup: comp.lang.forth

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    0.66user 0.42system 0:01.08elapsed 99%CPU fillseq (all-in-one)

    That's still a more than I expected from such a program, and I find
    the high proportion of system time especially surprising, so I wanted
    to check out how Forth systems compare. So I wrote a Forth program
    for doing the same thing, and here are the results:

    0.90user 0.00system 0:00.91elapsed 99%CPU gforth-fast
    ...
    In order to check out how the Forth systems do without this system
    call overheads, I defined synonyms for TYPE, SPACE and CR that work as
    noops, apart from the stack effect. So when the programs do not
    output their results, the run times are:

    0.68user 0.01system 0:00.70elapsed 98%CPU gforth
    0.20user 0.00system 0:00.20elapsed 98%CPU lxf
    0.43user 0.00system 0:00.43elapsed 98%CPU sf64
    0.31user 0.00system 0:00.32elapsed 99%CPU vfx64

    I would have expected the programs to spend a lot of time in division
    (inside #S) and to see less performance differences between systems,
    so seeing a 3.5 times speedup of lxf over gforth-fast is surprising to
    me; apparently the rest costs quite a bit more than the divisions.

    Some performance counter results:

    3245981222 945394481 cycles
    11679661274 2648084410 instructions
    1391034028 447127429 branches
    1521428 1243916 branch-misses
    0.4 18.2 % tma_backend_bound
    3.9 6.1 % tma_bad_speculation
    24.6 4.3 % tma_frontend_bound
    71.1 71.4 % tma_retiring
    ...
    What does this say about the performance of the program with output
    enabled? If lxf used buffered I/O from glibc, it would probably need
    another 0.2s (the difference between gforth-fast without and with
    output), but that would still be quite fast.
    ...
    (Doing buffered I/O is doable in Rust,
    but takes a little more work than I want to do at the moment).

    I did that, in two ways:

    1) Construct the lines, and then output them unbuffered.
    2) Output each string unbuffered, and compute the line length explicitly.

    The timing results are:

    0.61user 0.00system 0:00.62elapsed 99%CPU 1 (output lines unbuffered)
    0.20user 0.02system 0:00.23elapsed 98%CPU 2 (output each string unbuffered)

    So the unbuffered line output costs ~0.5s in the Rust program, and
    constructing the lines costs ~0.4s. Unbuffered output of the shorter
    strings would probably be more expensive than constructing the lines
    and outputting that; the Rust buffered I/O obviously uses a more
    efficient way to construct the buffer than I use for the lines.

    Performance counter results:

    gforth-fast lxf Rust
    no output no output buffered 2
    3245_981222 945_394481 1062_756213 cycles
    11679_661274 2648_084410 4471_844679 instructions
    1391_034028 447_127429 888_494927 branches
    1_521428 1_243916 1_329412 branch-misses
    0.4 18.2 3.3 % tma_backend_bound
    3.9 6.1 3.8 % tma_bad_speculation
    24.6 4.3 10.9 % tma_frontend_bound
    71.1 71.4 82.0 % tma_retiring

    One
    problem is that it has to convert the number to a string, and it has
    to heap-allocate (and later free) that string.

    Maybe the compiler manages to optimize that away.

    Forth:
    ---------------------------------------
    70 constant linelength

    \ synonym type 2drop
    \ synonym cr noop
    \ synonym space noop

    : main ( u1 -- )
    dup 0= if exit then
    ." 0"
    1 swap 1 ?do ( len )
    i 0 <# #s #> rot 2dup + 1+ linelength > if ( c-addr u len )
    if ( c-addr u )
    cr then
    tuck
    else
    space 1+ over + -rot then
    type loop
    if cr then ;
    ---------------------------------------------

    Rust constructing lines, with unbuffered output of that, this time
    with correct linelength:
    --------------------------------------------
    use std::env;

    fn main() {
    let linelength = 70;
    let args: Vec<String> = env::args().collect();
    let arg1 = args.get(1).expect("Usage: <executable> <n>").parse::<usize>().unwrap();
    if arg1>0 {
    let mut line = "0".to_string();
    for i in 1..arg1 {
    let received = i.to_string();
    if line.len()+1+received.len() > linelength {
    if line.len()>0 {
    println!("{line}");
    }
    line = received;
    } else {
    line = format!("{line} {received}");
    }
    }
    if line.len() > 0 {
    println!("{line}");
    }
    }
    }
    -----------------------------------------

    Rust with buffered output of each string (2 above): ------------------------------------------------
    use std::env;
    use std::io::{BufWriter, Write};

    fn main() {
    let linelength = 70;
    let args: Vec<String> = env::args().collect();
    let arg1 = args.get(1).expect("Usage: <executable> <n>").parse::<usize>().unwrap();
    let mut out = BufWriter::new(std::io::stdout());
    if arg1>0 {
    write!(out,"0").unwrap();
    let mut len=1;
    for i in 1..arg1 {
    let received = i.to_string();
    if len+1+received.len() > linelength {
    if len>0 {
    writeln!(out,"").unwrap();
    }
    write!(out,"{received}").unwrap();
    len = received.len();
    } else {
    write!(out," {received}").unwrap();
    len += received.len()+1;
    }
    }
    if len > 0 {
    writeln!(out,"").unwrap();
    }
    }
    out.flush().unwrap();
    }
    --------------------------------------------------

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
    EuroForth 2025 registration: https://euro.theforth.net/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Nov 22 16:43:32 2025
    From Newsgroup: comp.lang.forth

    albert@spenarnc.xs4all.nl writes:
    In article <2025Nov22.094635@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    <SNIP>
    I have not tried to disable the output in Rust, because I expect that
    the compiler would then eliminate parts of the rest (Forth compilers
    are not that sophisticated yet, at least not for stuff like "<# #s #>"
    if the result is eventually 2DROPped),

    It is hard to see that <# #S #> could be optimised away because of side >effects in the PAD area in normal implementations.

    Yes, if you don't know that the program terminates without accessing
    the hold area, you would have perform at least the last <# #S #>. The
    earlier ones can be eliminated in this program where the last one is
    guaranteed to be at least as long as the preceeding ones, but that
    kind of reasoning would be pretty extreme for a compiler optimizer.

    If MAIN ends with BYE, however, the compiler can clearly see that the
    hold area is never used, so <# #S #> could be optimized into 2DROP. I
    don't think it makes much sense to teach that to Forth compilers, but
    for the Rust compiler, I would not be surprised.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
    EuroForth 2025 registration: https://euro.theforth.net/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Nov 22 17:54:30 2025
    From Newsgroup: comp.lang.forth

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes: >anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    In order to check out how the Forth systems do without this system
    call overheads, I defined synonyms for TYPE, SPACE and CR that work as >>noops, apart from the stack effect. So when the programs do not
    output their results, the run times are:

    I achieved a small speedup by optimizing #S. The new one looks as
    follows:

    : #s ( ud -- 0 0 ) \ core number-sign-s
    dup if
    begin
    #
    dup 0= until
    then
    drop begin
    base @ u/mod swap digit hold
    dup 0= until
    0 ;

    0.68user 0.01system 0:00.70elapsed 98%CPU gforth-fast old #S
    0.57user 0.00system 0:00.59elapsed 97%CPU gforth-fast new #S

    Performance counter results:

    old #S new #S
    gforth-fast gforth-fast lxf Rust
    no output no output no output buffered 2
    3245_981222 2690_088360 945_394481 1062_756213 cycles
    11679_661274 9813_132978 2648_084410 4471_844679 instructions
    1391_034028 1204_585688 447_127429 888_494927 branches
    1_521428 1_520834 1_243916 1_329412 branch-misses
    0.4 3.3 18.2 3.3 % tma_backend_bound
    3.9 3.9 6.1 3.8 % tma_bad_speculation
    24.6 19.5 4.3 10.9 % tma_frontend_bound
    71.1 73.3 71.4 82.0 % tma_retiring

    I also looked at where Rust's buffered 2 variant spends its time, with
    perf record and perf report:

    18.62% fillseq1::main
    15.51% cfree@GLIBC_2.2.5
    11.40% core::fmt::write
    9.38% core::fmt::num::imp::<impl usize>::_fmt
    8.95% malloc
    7.99% _ZN81_$LT$std..io..default_write_fmt..Adapter$LT$T$GT$$u20$as$u20$core..f
    7.58% std::io::default_write_fmt
    7.12% __memmove_evex_unaligned_erms
    5.27% core::fmt::Formatter::pad

    [Everything else is <2.5% individually, and <10% total.]

    So malloc, free and memmove consume a significant part of the remaining time

    Looking into fillseq1::main, there is nothing that catches my eye in
    the hot part of the code.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
    EuroForth 2025 registration: https://euro.theforth.net/
    --- Synchronet 3.21a-Linux NewsLink 1.2