• Optimizing #S

    From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sun Nov 23 09:36:31 2025
    From Newsgroup: comp.lang.forth

    I recently improved #S with a separate loop when the high call of the
    input number is 0:

    : #s ( ud -- 0 0 ) \ core number-sign-s
    dup if
    begin
    #
    dup 0= until
    then
    drop begin
    base @ u/mod swap digit hold
    dup 0= until
    0 ;

    This gives a nice speedup for fillseq.4th <2025Nov22.185430@mips.complang.tuwien.ac.at>. I have now
    special-cased the second loop for base #10:

    : #s ( ud -- 0 0 ) \ core number-sign-s
    \G Used between @code{<<#} and @code{#>}. Prepend all digits of
    \G @var{ud} to the pictured numeric output string. @code{#s} will
    \G convert at least one digit. Therefore, if @var{ud} is 0,
    \G @code{#s} will prepend a ``0'' to the pictured numeric output
    \G string.
    dup if
    begin
    #
    dup 0= until
    then
    drop
    base @ #10 = if
    begin
    #10 u/mod swap '0' + hold
    dup 0= until
    else
    begin
    base @ u/mod swap digit hold
    dup 0= until
    then
    0 ;

    This provides another nice speedup (see below).

    I have also tried using a special primitive #10u/mod, but on
    Rocketlake it caused a slowdown. Gcc selected code that used
    multiplication instead of division and replaced the mod part not with multiplication and subtraction, but with several instructions, so the
    end result consumes more instructions. And on CPUs like Rocket Lake
    with fast division, it also consumes more cycles. Given that recent
    AMD CPUs also have fast division, I removed #10u/mod again. My guess
    is that gcc generated this code for Skylake and earlier Intel CPUs
    where division was slow.

    old #S #S opt1 #S opt2 worse
    one loop two loops + #10 loop + #10u/mod
    3245_981222 2690_088360 2422_977895 2492_586635 cycles 11679_661274 9813_132978 8564_869788 8909_131947 instructions
    1391_034028 1204_585688 1086_707686 1086_667791 branches
    1_521428 1_520834 1_516859 1_515857 branch-misses
    0.4 3.3 0.4 0.4 % tma_backend_bound
    3.9 3.9 3.5 3.5 % tma_bad_speculation
    24.6 19.5 25.4 25.8 % tma_frontend_bound
    71.1 73.3 70.7 70.4 % tma_retiring

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
    EuroForth 2025 registration: https://euro.theforth.net/
    --- Synchronet 3.21a-Linux NewsLink 1.2