Forum: War Ensemble BBS

Rust, Forth and performance

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Nov 22 08:46:35 2025

From Newsgroup: comp.lang.forth

I explored the performance of threads and futures (aka async/await, a
variant of cooperative multitasking) in Rust, and as an example
program, I have one thread/future that generates numbers from 0 to <n (n=10_000_000 in my measurements), one that takes numbers and converts
them into strings, one that takes strings and outputs lines (up to
linelength chars), and finally one that takes lines and prints them.

The results are (with release builds, on a 4.7GHz Rocketlake, output
piped to wc):

3.59user 2.09system 0:02.64elapsed 215%CPU fillthread
2.55user 0.51system 0:03.06elapsed 100%CPU fillasync (futures)

The output consists of just 79MB, and I expected that to take less
time, so I also made a version that hast the whole logic in one piece
instead of separated. That's easy in this case, because all but one
part (constructing lines) is trivial. The result is:

0.66user 0.42system 0:01.08elapsed 99%CPU fillseq (all-in-one)

That's still a more than I expected from such a program, and I find
the high proportion of system time especially surprising, so I wanted
to check out how Forth systems compare. So I wrote a Forth program
for doing the same thing, and here are the results:

0.90user 0.00system 0:00.91elapsed 99%CPU gforth-fast
2.07user 8.59system 0:10.67elapsed 99%CPU lxf
1.80user 5.05system 0:06.87elapsed 99%CPU sf64
1.49user 5.06system 0:06.56elapsed 99%CPU vfx64

The large amount of system time in systems other than Gforth is due to
these systems performing a system call everytime TYPE, SPACE, or CR
are called, whereas Gforth uses buffered I/O and performs a system
call only when the buffer is full. The large (but not as large as for
some Forth systems) system time of the Rust binary indicates that
there is lots of system calling going on there, too, probably one
system call per line (in the Rust program, lines are built, and only
then output).

In order to check out how the Forth systems do without this system
call overheads, I defined synonyms for TYPE, SPACE and CR that work as
noops, apart from the stack effect. So when the programs do not
output their results, the run times are:

0.68user 0.01system 0:00.70elapsed 98%CPU gforth
0.20user 0.00system 0:00.20elapsed 98%CPU lxf
0.43user 0.00system 0:00.43elapsed 98%CPU sf64
0.31user 0.00system 0:00.32elapsed 99%CPU vfx64

I would have expected the programs to spend a lot of time in division
(inside #S) and to see less performance differences between systems,
so seeing a 3.5 times speedup of lxf over gforth-fast is surprising to
me; apparently the rest costs quite a bit more than the divisions.

Some performance counter results:

3245981222 945394481 cycles
11679661274 2648084410 instructions
1391034028 447127429 branches
1521428 1243916 branch-misses
0.4 18.2 % tma_backend_bound
3.9 6.1 % tma_bad_speculation
24.6 4.3 % tma_frontend_bound
71.1 71.4 % tma_retiring

So, a lot of difference in the number of executed instructions. When
drilling down the tma_retiring result (perf stat -M tma_retiring_group
...), only 1% of the slots in the gforth-fast result are
tma_heavy_operations, whereas 10.6% are for lxf. I assume that the
divisions are heavy operations; but the 10% mean that even lxf spends
much of its time elsewhere; drilling further down gives a result of
~17% tma_few_uops_instructions, which are probably mostly division.

What does this say about the performance of the program with output
enabled? If lxf used buffered I/O from glibc, it would probably need
another 0.2s (the difference between gforth-fast without and with
output), but that would still be quite fast.

I have not tried to disable the output in Rust, because I expect that
the compiler would then eliminate parts of the rest (Forth compilers
are not that sophisticated yet, at least not for stuff like "<# #s #>"
if the result is eventually 2DROPped), so it's hard to tell how Rust
would perform with buffered I/O (Doing buffered I/O is doable in Rust,
but takes a little more work than I want to do at the moment). One
problem is that it has to convert the number to a string, and it has
to heap-allocate (and later free) that string. In my program, the concatenation of the strings is also a costly operation.

Here are the programs:

Forth:
---------------------------------------
70 constant linelength

\ synonym type 2drop
\ synonym cr noop
\ synonym space noop

: main ( u1 -- )
dup 0= if exit then
." 0"
1 swap 1 ?do ( len )
i 0 <# #s #> rot 2dup + 1+ linelength > if ( c-addr u len )
if ( c-addr u )
cr then
tuck
else
space 1+ over + -rot then
type loop
if cr then ;
---------------------------------------------

Rust:
--------------------------------------------
use std::env;

fn main() {
let linelength = 2;
let args: Vec<String> = env::args().collect();
let arg1 = args.get(1).expect("Usage: <executable> <n>").parse::<usize>().unwrap();
if arg1>0 {
let mut line = "0".to_string();
for i in 1..arg1 {
let received = i.to_string();
if line.len()+1+received.len() > linelength {
if line.len()>0 {
println!("{line}");
}
line = received;
} else {
line = format!("{line} {received}");
}
}
if line.len() > 0 {
println!("{line}");
}
}
}
-----------------------------------------

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
EuroForth 2025 registration: https://euro.theforth.net/
--- Synchronet 3.21a-Linux NewsLink 1.2

From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Sat Nov 22 12:30:04 2025

From Newsgroup: comp.lang.forth

In article <2025Nov22.094635@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
<SNIP>

I have not tried to disable the output in Rust, because I expect that
the compiler would then eliminate parts of the rest (Forth compilers
are not that sophisticated yet, at least not for stuff like "<# #s #>"
if the result is eventually 2DROPped),

It is hard to see that <# #S #> could be optimised away because of side
effects in the PAD area in normal implementations.
1.0 <# #S #> TYPE SPACE HLD @ PAD OVER - TYPE

10 10 OK
1.0 <# #S #> 2DROP HLD @ PAD OVER - TYPE
10 OK

My optimiser detects side effect free sequences and eliminates
those if the result is dropped, but that is not the case here .
Only a lispified 1] Forth that eliminates HLD can pull this off.
Even detecting on the spot that an allocated buffer can be freed,
such that it is possible to eliminate allocation of that buffer
(and subsequent freeing) were quite a feat.

lispified : all temporary buffers are allocated.

- anton

--
The Chinese government is satisfied with its military superiority over USA.
The next 5 year plan has as primary goal to advance life expectancy
over 80 years, like Western Europe.
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Nov 22 11:38:15 2025

From Newsgroup: comp.lang.forth

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

0.66user 0.42system 0:01.08elapsed 99%CPU fillseq (all-in-one)

That's still a more than I expected from such a program, and I find
the high proportion of system time especially surprising, so I wanted
to check out how Forth systems compare. So I wrote a Forth program
for doing the same thing, and here are the results:

0.90user 0.00system 0:00.91elapsed 99%CPU gforth-fast

...

In order to check out how the Forth systems do without this system
call overheads, I defined synonyms for TYPE, SPACE and CR that work as
noops, apart from the stack effect. So when the programs do not
output their results, the run times are:

0.68user 0.01system 0:00.70elapsed 98%CPU gforth
0.20user 0.00system 0:00.20elapsed 98%CPU lxf
0.43user 0.00system 0:00.43elapsed 98%CPU sf64
0.31user 0.00system 0:00.32elapsed 99%CPU vfx64

I would have expected the programs to spend a lot of time in division
(inside #S) and to see less performance differences between systems,
so seeing a 3.5 times speedup of lxf over gforth-fast is surprising to
me; apparently the rest costs quite a bit more than the divisions.

Some performance counter results:

3245981222 945394481 cycles
11679661274 2648084410 instructions
1391034028 447127429 branches
1521428 1243916 branch-misses
0.4 18.2 % tma_backend_bound
3.9 6.1 % tma_bad_speculation
24.6 4.3 % tma_frontend_bound
71.1 71.4 % tma_retiring

...

What does this say about the performance of the program with output
enabled? If lxf used buffered I/O from glibc, it would probably need
another 0.2s (the difference between gforth-fast without and with
output), but that would still be quite fast.

...

(Doing buffered I/O is doable in Rust,
but takes a little more work than I want to do at the moment).

I did that, in two ways:

1) Construct the lines, and then output them unbuffered.
2) Output each string unbuffered, and compute the line length explicitly.

The timing results are:

0.61user 0.00system 0:00.62elapsed 99%CPU 1 (output lines unbuffered)
0.20user 0.02system 0:00.23elapsed 98%CPU 2 (output each string unbuffered)

So the unbuffered line output costs ~0.5s in the Rust program, and
constructing the lines costs ~0.4s. Unbuffered output of the shorter
strings would probably be more expensive than constructing the lines
and outputting that; the Rust buffered I/O obviously uses a more
efficient way to construct the buffer than I use for the lines.

Performance counter results:

gforth-fast lxf Rust
no output no output buffered 2
3245_981222 945_394481 1062_756213 cycles
11679_661274 2648_084410 4471_844679 instructions
1391_034028 447_127429 888_494927 branches
1_521428 1_243916 1_329412 branch-misses
0.4 18.2 3.3 % tma_backend_bound
3.9 6.1 3.8 % tma_bad_speculation
24.6 4.3 10.9 % tma_frontend_bound
71.1 71.4 82.0 % tma_retiring

One
problem is that it has to convert the number to a string, and it has
to heap-allocate (and later free) that string.

Maybe the compiler manages to optimize that away.

Forth:
---------------------------------------
70 constant linelength

\ synonym type 2drop
\ synonym cr noop
\ synonym space noop

: main ( u1 -- )
dup 0= if exit then
." 0"
1 swap 1 ?do ( len )
i 0 <# #s #> rot 2dup + 1+ linelength > if ( c-addr u len )
if ( c-addr u )
cr then
tuck
else
space 1+ over + -rot then
type loop
if cr then ;
---------------------------------------------

Rust constructing lines, with unbuffered output of that, this time
with correct linelength:
--------------------------------------------
use std::env;

fn main() {
let linelength = 70;
let args: Vec<String> = env::args().collect();
let arg1 = args.get(1).expect("Usage: <executable> <n>").parse::<usize>().unwrap();
if arg1>0 {
let mut line = "0".to_string();
for i in 1..arg1 {
let received = i.to_string();
if line.len()+1+received.len() > linelength {
if line.len()>0 {
println!("{line}");
}
line = received;
} else {
line = format!("{line} {received}");
}
}
if line.len() > 0 {
println!("{line}");
}
}
}
-----------------------------------------

Rust with buffered output of each string (2 above): ------------------------------------------------
use std::env;
use std::io::{BufWriter, Write};

fn main() {
let linelength = 70;
let args: Vec<String> = env::args().collect();
let arg1 = args.get(1).expect("Usage: <executable> <n>").parse::<usize>().unwrap();
let mut out = BufWriter::new(std::io::stdout());
if arg1>0 {
write!(out,"0").unwrap();
let mut len=1;
for i in 1..arg1 {
let received = i.to_string();
if len+1+received.len() > linelength {
if len>0 {
writeln!(out,"").unwrap();
}
write!(out,"{received}").unwrap();
len = received.len();
} else {
write!(out," {received}").unwrap();
len += received.len()+1;
}
}
if len > 0 {
writeln!(out,"").unwrap();
}
}
out.flush().unwrap();
}
--------------------------------------------------

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
EuroForth 2025 registration: https://euro.theforth.net/
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Nov 22 16:43:32 2025

From Newsgroup: comp.lang.forth

albert@spenarnc.xs4all.nl writes:

In article <2025Nov22.094635@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
<SNIP>

I have not tried to disable the output in Rust, because I expect that
the compiler would then eliminate parts of the rest (Forth compilers
are not that sophisticated yet, at least not for stuff like "<# #s #>"
if the result is eventually 2DROPped),

It is hard to see that <# #S #> could be optimised away because of side >effects in the PAD area in normal implementations.

Yes, if you don't know that the program terminates without accessing
the hold area, you would have perform at least the last <# #S #>. The
earlier ones can be eliminated in this program where the last one is
guaranteed to be at least as long as the preceeding ones, but that
kind of reasoning would be pretty extreme for a compiler optimizer.

If MAIN ends with BYE, however, the compiler can clearly see that the
hold area is never used, so <# #S #> could be optimized into 2DROP. I
don't think it makes much sense to teach that to Forth compilers, but
for the Rust compiler, I would not be surprised.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
EuroForth 2025 registration: https://euro.theforth.net/
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Nov 22 17:54:30 2025

From Newsgroup: comp.lang.forth

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes: >anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

In order to check out how the Forth systems do without this system
call overheads, I defined synonyms for TYPE, SPACE and CR that work as >>noops, apart from the stack effect. So when the programs do not
output their results, the run times are:

I achieved a small speedup by optimizing #S. The new one looks as
follows:

: #s ( ud -- 0 0 ) \ core number-sign-s
dup if
begin
#
dup 0= until
then
drop begin
base @ u/mod swap digit hold
dup 0= until
0 ;

0.68user 0.01system 0:00.70elapsed 98%CPU gforth-fast old #S
0.57user 0.00system 0:00.59elapsed 97%CPU gforth-fast new #S

Performance counter results:

old #S new #S
gforth-fast gforth-fast lxf Rust
no output no output no output buffered 2
3245_981222 2690_088360 945_394481 1062_756213 cycles
11679_661274 9813_132978 2648_084410 4471_844679 instructions
1391_034028 1204_585688 447_127429 888_494927 branches
1_521428 1_520834 1_243916 1_329412 branch-misses
0.4 3.3 18.2 3.3 % tma_backend_bound
3.9 3.9 6.1 3.8 % tma_bad_speculation
24.6 19.5 4.3 10.9 % tma_frontend_bound
71.1 73.3 71.4 82.0 % tma_retiring

I also looked at where Rust's buffered 2 variant spends its time, with
perf record and perf report:

18.62% fillseq1::main
15.51% cfree@GLIBC_2.2.5
11.40% core::fmt::write
9.38% core::fmt::num::imp::<impl usize>::_fmt
8.95% malloc
7.99% _ZN81_$LT$std..io..default_write_fmt..Adapter$LT$T$GT$$u20$as$u20$core..f
7.58% std::io::default_write_fmt
7.12% __memmove_evex_unaligned_erms
5.27% core::fmt::Formatter::pad

[Everything else is <2.5% individually, and <10% total.]

So malloc, free and memmove consume a significant part of the remaining time

Looking into fillseq1::main, there is nothing that catches my eye in
the hot part of the code.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
EuroForth 2025 registration: https://euro.theforth.net/
--- Synchronet 3.21a-Linux NewsLink 1.2

Who's Online
Recent Visitors
- Ptb1970
  Sat Dec 13 17:34:42 2025
  from Wisconsin via Telnet
- Microbot
  Sat Dec 13 17:04:31 2025
  from Moore, Ok via Telnet
- John F Kennedy
  Fri Dec 12 21:48:00 2025
  from Crazyworldbbs.Com:2323 via Telnet
- Microbot
  Fri Dec 12 18:16:00 2025
  from Moore, Ok via Telnet

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,089
Nodes:	10 (0 / 10)
Uptime:	153:54:10
Calls:	13,921
Calls today:	2
Files:	187,021
D/L today:	3,760 files (944M bytes)
Messages:	2,457,163

Rust, Forth and performance

Who's Online

Recent Visitors

System Info