• Idiomatic way to read a word of text from a file?

    From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Mon Nov 17 15:25:20 2025
    From Newsgroup: comp.lang.forth

    I'm playing with the idea of writing a Roff-like text formatter in
    Forth. The input is lines of text "blah blech, and this that the
    other...". The text lines can be arbitrarily long so I don't want to
    read the entire line into a memory buffer using something like REFILL.

    Let's say I don't have to worry about individual words overflowing
    memory though (segfault is not allowed, but it's ok to panic and quit).
    So the main loop will be to copy an input word to the output buffer and
    maybe flush the output buffer. The output buffer can be of fixed size.

    Also, some input lines will be formatting commands like ".i\n" (change
    font to italic). Those lines should be given to the Forth text
    interpreter.

    I guess I could use the FILE word set to write something like getc()
    with its own buffering, but that seems messy. I'm wondering if this is
    a common situation and there's an idiomatic solution.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From dxf@dxforth@gmail.com to comp.lang.forth on Tue Nov 18 13:21:48 2025
    From Newsgroup: comp.lang.forth

    On 18/11/2025 10:25 am, Paul Rubin wrote:
    ...
    I guess I could use the FILE word set to write something like getc()
    with its own buffering, but that seems messy.

    Less messy when someone has already done it. See BFILE (buffered files)
    and its documentation SFILE.TXT included in the zip below.

    FILESYS2.ZIP

    https://drive.google.com/drive/folders/1kh2WcPUc3hQpLcz7TQ-YQiowrozvxfGw

    Some assembly required.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Tue Nov 18 07:54:14 2025
    From Newsgroup: comp.lang.forth

    Paul Rubin <no.email@nospam.invalid> writes:
    I'm playing with the idea of writing a Roff-like text formatter in
    Forth. The input is lines of text "blah blech, and this that the
    other...". The text lines can be arbitrarily long so I don't want to
    read the entire line into a memory buffer using something like REFILL.

    Why not? For something like Roff (or TeX or Markdown etc.) the whole
    input file easily fits into RAM, so a line would fit, too. The
    question is if the Forth system supports long lines in REFILL.

    To test this, I wrote the following program

    .( : x dup . cr source nip <> if ." wrong length: " source . drop bye then ; ) cr

    : gen ( n -- )
    60 swap 0 ?do
    dup 10 + 8 .r ." x" dup spaces cr
    2* loop
    drop ;

    20 gen bye

    And then generated a file /tmp/long-lines.4th as follows:

    gforth ./gen-long-lines.4th >/tmp/long-lines.4th

    The longest line is more the 31M characters long; a system that can
    deal with that probably can deal with any line length for which it can
    allocate memory. Testing various systems by loading this file, I see:

    Gforth, lxf, and SwiftForth load the whole file.

    iforth-5.1-mini returns to the command line after printing 970,
    without error message. My guess is that iForth recognized that the
    next line is too long for its input buffer and silently called the
    command-line interpreter.

    vfx64: the last line interpreted is the one with 970 characters, and
    an error message "wrong length: 512" is shown. So vfx64 loaded the
    first 512 bytes of the 970-byte line, and gave it to the text
    interpreter. And then the interpreted program noticed that the line
    length is wrong, printed the error message and left the Forth system.

    Back to your question, we have at least three Forth systems capable of REFILLing long lines.

    Also, some input lines will be formatting commands like ".i\n" (change
    font to italic). Those lines should be given to the Forth text
    interpreter.

    And here's the signficance of REFILLing. You could pass everything to
    the text interpreter, and install the following recognizer sequence:
    First one that recognizes things like ".i\n", and second one that
    recognizes everything and then processes the line as ordinary words.

    An alternative solution I would use: slurp the whole file into memory,
    then process line by line (because of your line-oriented commands): In
    each line, check whether something like ".i\n" occurs (with the same
    logic that the recognizer would use); if so extract that line and
    EVALUATE it; if not, process the line as you do by default. I expect
    this alternative to be slightly more work, mainly because the line
    processing is not done automatically by the text interpreter.

    I guess I could use the FILE word set to write something like getc()
    with its own buffering, but that seems messy. I'm wondering if this is
    a common situation and there's an idiomatic solution.

    No, it's not common. If you have too little memory, the idiomatic
    solution is to limit the line length (see VFX64, even though it has
    enough memory).

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
    EuroForth 2025 registration: https://euro.theforth.net/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Tue Nov 18 13:05:49 2025
    From Newsgroup: comp.lang.forth

    In article <871plws0m7.fsf@nightsong.com>,
    Paul Rubin <no.email@nospam.invalid> wrote:
    I'm playing with the idea of writing a Roff-like text formatter in
    Forth. The input is lines of text "blah blech, and this that the
    other...". The text lines can be arbitrarily long so I don't want to
    read the entire line into a memory buffer using something like REFILL.

    Let's say I don't have to worry about individual words overflowing
    memory though (segfault is not allowed, but it's ok to panic and quit).
    So the main loop will be to copy an input word to the output buffer and
    maybe flush the output buffer. The output buffer can be of fixed size.

    Also, some input lines will be formatting commands like ".i\n" (change
    font to italic). Those lines should be given to the Forth text
    interpreter.

    I guess I could use the FILE word set to write something like getc()
    with its own buffering, but that seems messy. I'm wondering if this is
    a common situation and there's an idiomatic solution.

    Going character by character gets you little speed. Mostly
    GET-FILE ("slurp-file") is the way to go. Then can you make the file
    current input stream.
    SAVE RESTORE SET-SRC and parsing like PARSE-NAME PARSE are your friend.

    EXECUTE-PARSING is similar (bit 10 years after SAVE SET-SRC RESTORE.):
    : EXECUTE-PARSING ROT ROT SAVE SET-SRC CATCH RESTORE THROW ;

    \ ----------------------
    #!/usr/bin/lina -s
    \ Copyright 2015 (c): Albert van der Horst, Dutch Forth Worksshop by GPL

    \ wc , using all tricks ciforth has to offer, in script style.
    \ Usage: wc.script <filenames>

    ARGC 1 ?DO
    1 ARG[] 2DUP TYPE SPACE
    GET-FILE
    2DUP 0 >R BEGIN ^J $/ 2DROP OVER WHILE R> 1+ >R REPEAT R> . 2DROP
    2DUP SAVE SET-SRC 0 BEGIN NAME NIP WHILE 1+ REPEAT RESTORE .
    2DUP . DROP
    2DROP CR
    SHIFT-ARGS
    LOOP
    \ ----------------------
    (The -s option automatically loads argument handling and control
    structures interpretation.)

    Example :

    albert@sinas2:~/PROJECT/ciriscv$ wc.script [a-h]*.frt
    aap.frt 3 10 55
    blocks.frt 3712 20819 109460
    doit.frt 6 21 106
    hello.frt 0 3 19
    hellow.frt 1 8 40

    See also advance/lispl.frt

    https://github.com/albertvanderhorst/forthlisp

    The definition of TOKEN that replaces NAME that is my WORD.

    The transformation generates a lisp turnkey.

    Groetjes Albert
    --
    The Chinese government is satisfied with its military superiority over USA.
    The next 5 year plan has as primary goal to advance life expectancy
    over 80 years, like Western Europe.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Tue Nov 18 13:50:29 2025
    From Newsgroup: comp.lang.forth

    In article <2025Nov18.085414@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    <SNIP>

    No, it's not common. If you have too little memory, the idiomatic
    solution is to limit the line length (see VFX64, even though it has
    enough memory).

    In ciforth the only restriction is that a word fits in the input
    buffer. In the context of nroff it makes no sense to use REFILL.
    There is no REFILL in the ciforth kernel.
    QUIT itself cuts the input into lines, without copying:
    `` REMAINDER 2@ ^J $/ '' giving a string variable ( addr len ).
    If you parse yourself for e.g nroff you don't need to do that.
    Even in the MSDOS version you can parse files where restricted
    only in the length of a word (64 chars).
    I consider the ^J / ^M as blank space.

    - anton

    Groetjes Albert
    --
    The Chinese government is satisfied with its military superiority over USA.
    The next 5 year plan has as primary goal to advance life expectancy
    over 80 years, like Western Europe.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Gerry Jackson@do-not-use@swldwa.uk to comp.lang.forth on Tue Nov 18 12:56:15 2025
    From Newsgroup: comp.lang.forth

    On 17/11/2025 23:25, Paul Rubin wrote:
    I'm playing with the idea of writing a Roff-like text formatter in
    Forth. The input is lines of text "blah blech, and this that the
    other...". The text lines can be arbitrarily long so I don't want to
    read the entire line into a memory buffer using something like REFILL.

    Let's say I don't have to worry about individual words overflowing
    memory though (segfault is not allowed, but it's ok to panic and quit).
    So the main loop will be to copy an input word to the output buffer and
    maybe flush the output buffer. The output buffer can be of fixed size.

    Also, some input lines will be formatting commands like ".i\n" (change
    font to italic). Those lines should be given to the Forth text
    interpreter.

    I guess I could use the FILE word set to write something like getc()
    with its own buffering, but that seems messy. I'm wondering if this is
    a common situation and there's an idiomatic solution.

    Have you seen the Sam Falvo video at: https://www.youtube.com/watch?v=mvrE2ZGe-rs

    where, from memory so it might be inaccurate in detail, he demonstrates
    the development of a text preprocessor to convert items like ~bw to html <bold>

    He slurps a text file for conversion into dataspace 4K bytes at a time,
    using HERE as the start address, and using READ-FILE until READ-FILE
    returns 0 bytes read. Then using HERE again calculates the size of the
    data slurped in. So no buffer allocation is needed but, of course, a
    really big file might run out of dataspace.

    Also, what might be of interest is the way he develops an operator ===>
    (I think) that maps ~bw into <bold> by writing:
    ~bw ===> <bold>
    similarly for all other conversions to HTML. ISTR that ===> is
    non-standard as it uses return stack manipulation.
    --
    Gerry
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From peter@peter.noreply@tin.it to comp.lang.forth on Tue Nov 18 22:54:03 2025
    From Newsgroup: comp.lang.forth

    On Mon, 17 Nov 2025 15:25:20 -0800
    Paul Rubin <no.email@nospam.invalid> wrote:

    I'm playing with the idea of writing a Roff-like text formatter in
    Forth. The input is lines of text "blah blech, and this that the
    other...". The text lines can be arbitrarily long so I don't want to
    read the entire line into a memory buffer using something like REFILL.

    Let's say I don't have to worry about individual words overflowing
    memory though (segfault is not allowed, but it's ok to panic and quit).
    So the main loop will be to copy an input word to the output buffer and
    maybe flush the output buffer. The output buffer can be of fixed size.

    Also, some input lines will be formatting commands like ".i\n" (change
    font to italic). Those lines should be given to the Forth text
    interpreter.

    I guess I could use the FILE word set to write something like getc()
    with its own buffering, but that seems messy. I'm wondering if this is
    a common situation and there's an idiomatic solution.

    In lxf and lxf64 I have the following words to support processing files

    MAP-FILE ( addr len fam -- a2 l2 ior )
    UNMAP-FILE ( a2 l2 -- ior )
    maps a file into memory, fam is r/o or r/w

    GET-LINE ( a1 l1 -- a1 l3 a2 l2 )
    GET-WORD ( a1 l1 -- a1 l3 a2 l2 )
    takes a memory region, returns the first line/word on top of stack
    and remaining region below it.

    Here is a simple example to count lines and words in a file

    \ Process a file

    variable #words
    variable #lines

    : process-line ( a l -- )
    1 #lines +!
    begin
    dup while
    get-word 2drop 1 #words +!
    repeat
    2drop ;

    : process-file ( a l -- )
    r/o map-file throw
    0 #words ! 0 #lines !
    2dup
    begin
    dup while
    get-line process-line
    repeat
    2drop
    unmap-file throw
    ." the file has " #lines @ .
    ." lines and " #words @ .
    ." words!" ;

    As Anton has already noted lxf uses this internally for parsing source files Mapping the file uses memory regions outside the current process.
    It is like first allocating memory and then reading in the entire file

    BR
    Peter

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Tue Nov 18 18:02:24 2025
    From Newsgroup: comp.lang.forth

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Why not? For something like Roff (or TeX or Markdown etc.) the whole
    input file easily fits into RAM, so a line would fit, too. The
    question is if the Forth system supports long lines in REFILL.

    The target processor might not have that much ram. Back in school I
    used a version of Roff written for CP/M, which I ported to Turbo C for a
    PC-XT clone. On the other hand, none of my input files had very long
    lines. I was thinking of accomodating modern wysiwyg editors which
    don't have line breaks except at the end of paragraphs. Maybe that's
    not worthwhile.

    One obvious approach is to use READ-LINE, but this unfortunately seems
    to throw away the newline at the end of the line read, so it's hard to
    tell if a complete line has been read, or if the buffer has simply
    gotten full. Testing with gforth, if the buffer size is exactly the
    line length, then FILE-POSITION points to just after the line, and the
    next call to READ-LINE returns 0 chars. No idea about other Forths.

    And here's the signficance of REFILLing. You could pass everything to
    the text interpreter, and install the following recognizer sequence:
    First one that recognizes things like ".i\n", and second one that
    recognizes everything and then processes the line as ordinary words.

    I'll see if I can figure out how to do that, though the target Forth
    might not have recognizers. What I wanted is a loop like
    LOOP
    READ a line;
    IF line begins with ".", then pass the line to the text interpreter;
    ELSE loop through the words on the line, copying them to the output
    buffer or maybe to the output device
    END LOOP

    Getting words from the line should preferably use Forth's built-in
    parser.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Tue Nov 18 18:24:49 2025
    From Newsgroup: comp.lang.forth

    peter <peter.noreply@tin.it> writes:
    As Anton has already noted lxf uses this internally for parsing source files Mapping the file uses memory regions outside the current process.
    It is like first allocating memory and then reading in the entire file

    Yes mmap is a virtual memory thing though. My current thought is to use READ-LINE with some fixed buffer size, but have two contiguous buffers
    and call READ-LINE twice, to handle the case where a word is split
    across two buffers. Then process all words (whitespace terminated)
    until the last whitespace is found. Anything left gets copied back
    to the beginning of the double buffer, before calling READ-LINE again.
    It's not worth bothering with true double buffering.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Wed Nov 19 07:27:22 2025
    From Newsgroup: comp.lang.forth

    Paul Rubin <no.email@nospam.invalid> writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Why not? For something like Roff (or TeX or Markdown etc.) the whole
    input file easily fits into RAM, so a line would fit, too. The
    question is if the Forth system supports long lines in REFILL.

    The target processor might not have that much ram.

    "Might"? If you have a concrete target with small RAM (say, one of
    the Mecrisp targets), it will come with additional restrictions, but
    maybe also additional capabilities (like accessing the input file
    directly in its flash storage), and all of this might change how you
    approach the problem.

    I was thinking of accomodating modern wysiwyg editors which
    don't have line breaks except at the end of paragraphs. Maybe that's
    not worthwhile.

    If you have a machine where you run a WYSIWYG editor, you also have
    enough RAM for keeping one line (and probably the whole text). The
    systems with small RAM memory tend to have only a line editor (with
    80-char lines), or maybe a screen editor (with 1KB screens).

    One obvious approach is to use READ-LINE, but this unfortunately seems
    to throw away the newline at the end of the line read, so it's hard to
    tell if a complete line has been read, or if the buffer has simply
    gotten full.

    <https://forth-standard.org/standard/file/READ-LINE> says:
    |When u1 = u2 the line terminator has yet to be reached.

    Testing with gforth, if the buffer size is exactly the
    line length, then FILE-POSITION points to just after the line, and the
    next call to READ-LINE returns 0 chars. No idea about other Forths.

    With u1=line length, the only way to satisfy the requirement above is
    to deliver it as two parts, one with u2=u1, the other with u2=0.
    READ-LINE and the case where u1=line length have been discussed
    several times, so apparently it's not so clear to some how a system
    should behave, so you may want to check the system you use, and report
    a bug to the system implementor if it does not behave correctly.

    As for FILE-POSITION, what I see in Gforth is (output after "\"):

    s" /tmp/long-lines.4th" r/o open-file throw constant f \ ok
    pad 74 f read-line throw . . \ -1 74 ok
    f file-position throw ud. \ 74 ok
    pad 70 f read-line throw . . \ -1 0 ok
    f file-position throw ud. \ 75 ok

    That's with a file with one-byte newlines.

    And here's the signficance of REFILLing. You could pass everything to
    the text interpreter, and install the following recognizer sequence:
    First one that recognizes things like ".i\n", and second one that
    recognizes everything and then processes the line as ordinary words.

    I'll see if I can figure out how to do that, though the target Forth
    might not have recognizers. What I wanted is a loop like
    LOOP
    READ a line;
    IF line begins with ".", then pass the line to the text interpreter;
    ELSE loop through the words on the line, copying them to the output
    buffer or maybe to the output device
    END LOOP

    If you rely on REFILL (but then you have to INCLUDE the file, and have
    an executable word at its start), you could implement that as:

    : process-line ( -- )
    begin
    source nip >in @ u> while
    parse-name type \ or whatever you want to do with words
    repeat ;

    : roff ( -- ) \ untested
    begin
    refill while
    source if
    c@ '.' = if
    source evaluate [ 0 cs-pick ] again then
    else
    drop then
    process-line
    repeat ;

    The alternative with the recognizers avoids the need to say ROFF at
    the start of the file. Or you have a Forth system that implements EXECUTE-PARSING-FILE.

    Alternatively, you could go for using READ-LINE. In that case I would
    treat too-long lines as errors, and the result would look as follows:

    80 constant line-length \ however long lines you have space to process
    create line line-length 2 + allot

    : process-line ( c-addr u -- )
    ... \ without PARSE-NAME support unless you use EXECUTE-PARSING
    ;

    : roff {: file-id -- :}
    begin
    line line-length file-id read-line throw while
    dup line-length >= abort" line too long"
    dup if
    line c@ '.' = if
    line swap evaluate [ 0 cs-pick ] again then
    then
    line swap process-line
    repeat ;

    Getting words from the line should preferably use Forth's built-in
    parser.

    That means going through INCLUDE, EVALUATE, EXECUTE-PARSING, or EXECUTE-PARSING-FILE at some point.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
    EuroForth 2025 registration: https://euro.theforth.net/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Wed Nov 19 08:45:18 2025
    From Newsgroup: comp.lang.forth

    Paul Rubin <no.email@nospam.invalid> writes:
    My current thought is to use
    READ-LINE with some fixed buffer size, but have two contiguous buffers
    and call READ-LINE twice, to handle the case where a word is split
    across two buffers. Then process all words (whitespace terminated)
    until the last whitespace is found. Anything left gets copied back
    to the beginning of the double buffer, before calling READ-LINE again.

    Given your original requirement of having a buffer big enough for a
    word, but not for a line, I would not bother with READ-LINE.

    If you don't mind slow speed, one way to go (as used by some Forth
    systems for implementing READ-LINE) is to fill the buffer with
    READ-FILE, find the start of a word, REPOSITION-FILE to that start,
    READ-FILE the buffer, and you have your word in the buffer.
    REPOSITION-FILE at the end of the word, and then repeat for the next
    word.

    A more efficient approach is to READ-FILE into a buffer of size >=1
    (the larger, the more efficient), and copy from that buffer into a
    word buffer when a word is found. After delivering a word, you have
    to remember where in the buffer you were, and continue from there.
    Whenever you reach the end of the buffer, refill it with READ-FILE.

    In either case, when you find a newline (and you have to know how
    newlines can be represented), check the next character after the
    newline for ".", and deal with that. But "pass the line to the text interpreter" does not work if you have no space for a buffer that
    contains a whole line.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
    EuroForth 2025 registration: https://euro.theforth.net/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Wed Nov 19 11:22:08 2025
    From Newsgroup: comp.lang.forth

    In article <87o6oyrc7i.fsf@nightsong.com>,
    Paul Rubin <no.email@nospam.invalid> wrote:
    peter <peter.noreply@tin.it> writes:
    As Anton has already noted lxf uses this internally for parsing source files >> Mapping the file uses memory regions outside the current process.
    It is like first allocating memory and then reading in the entire file

    Yes mmap is a virtual memory thing though. My current thought is to use >READ-LINE with some fixed buffer size, but have two contiguous buffers
    and call READ-LINE twice, to handle the case where a word is split
    across two buffers. Then process all words (whitespace terminated)
    until the last whitespace is found. Anything left gets copied back
    to the beginning of the double buffer, before calling READ-LINE again.
    It's not worth bothering with true double buffering.

    The ciforth approach is more sensible. The terminal input buffer is
    filled from the input stream. It is large say 16K.
    Now you carve lines out of the buffer, and use a parse pointer,
    maybe >IN. As soon as you find that there are no more line endings
    in the remaining buffer, you copy the remainder to the start of
    the buffer and fill the buffer to the brim.
    In this case you will not copy more than is necessary.

    Groetjes Albert
    --
    The Chinese government is satisfied with its military superiority over USA.
    The next 5 year plan has as primary goal to advance life expectancy
    over 80 years, like Western Europe.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Wed Nov 19 17:43:19 2025
    From Newsgroup: comp.lang.forth

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    "Might"? If you have a concrete target with small RAM (say, one of
    the Mecrisp targets),

    I was thinking Fuzix (fuzix.org) doesn't have a Roff or a Forth that I
    know of, but it might be an interesting target. It runs on the
    Raspberry Pi Pico (256K of ram and 2MB of flash) which is enough for
    most Forth purposes.

    If you have a machine where you run a WYSIWYG editor, you also have
    enough RAM for keeping one line (and probably the whole text).

    The idea is to format files that came from other people and other
    machines. Though that's maybe kind of dumb because nobody cares about
    Roff any more. It interests me because I used to use it for my school
    papers. IDK if I have any of those files around any more though.

    READ-LINE and the case where u1=line length have been discussed
    several times, so apparently it's not so clear to some how a system
    should behave,

    That sounds like a deficiency in the standard, but anyway yes, there are
    ways to get around it.

    Alternatively, you could go for using READ-LINE. In that case I would
    treat too-long lines as errors

    Yeah I think that's the sanest approach. Another idea is to abandon
    Roff syntax altogether, and go for an HTML subset or similar. I hate
    Markdown but that's yet another possibility.

    That means going through INCLUDE, EVALUATE, EXECUTE-PARSING, or EXECUTE-PARSING-FILE at some point.

    I see. That can still can be ok of course. It's in the Forth spirit, I
    think, to just do something hacky to detect whether the token read by PARSE-NAME is at the beginning of a line.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From dxf@dxforth@gmail.com to comp.lang.forth on Thu Nov 20 19:03:11 2025
    From Newsgroup: comp.lang.forth

    On 20/11/2025 12:43 pm, Paul Rubin wrote:
    ...
    READ-LINE and the case where u1=line length have been discussed
    several times, so apparently it's not so clear to some how a system
    should behave,

    That sounds like a deficiency in the standard, but anyway yes, there are
    ways to get around it.

    When you say 'get around it', do you mean a broken line? If so what can
    be done with a broken line because AFAIK not much. I replaced 'flag' in READ-LINE with a trinary and I'm now looking for uses :-)

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Thu Nov 20 12:49:32 2025
    From Newsgroup: comp.lang.forth

    In article <691ecb3e$1@news.ausics.net>, dxf <dxforth@gmail.com> wrote:
    On 20/11/2025 12:43 pm, Paul Rubin wrote:
    ...
    READ-LINE and the case where u1=line length have been discussed
    several times, so apparently it's not so clear to some how a system
    should behave,

    That sounds like a deficiency in the standard, but anyway yes, there are
    ways to get around it.

    When you say 'get around it', do you mean a broken line? If so what can
    be done with a broken line because AFAIK not much. I replaced 'flag' in >READ-LINE with a trinary and I'm now looking for uses :-)

    The TIB in ciforth is huge, 1) and is read full as used as a buffer
    in the context of redirecting.
    Then split into lines (as long as you fancy doing this), and
    if there is no more newline, copy the remainder to the start of
    the buffer, and fill again. REFILL-TIB in this context is hardly
    more complicated than a classic REFILL.
    QUIT uses (ACCEPT) to cut up the input buffer, but since this
    is an indirect threaded Forth (ACCEPT) can be revectored.
    It is easy to add a cludge to detect `` ^J.i ''.

    1) It doesn't matter if it is small, actually.

    Groetjes Albert
    --
    The Chinese government is satisfied with its military superiority over USA.
    The next 5 year plan has as primary goal to advance life expectancy
    over 80 years, like Western Europe.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Thu Nov 20 14:53:12 2025
    From Newsgroup: comp.lang.forth

    dxf <dxforth@gmail.com> writes:
    When you say 'get around it', do you mean a broken line?

    The trouble distinguishing between a broken and an unbroken line when
    u1=line length. I think for roff though, it's ok to abandon the wish to
    handle arbitrarily long lines. The deficiency in the standard is not explaining READ-LINE's exact behaviour in this situation. I was able to experimentally resolve the issue in gforth, but other Forths might vary.

    I had for a while liked the idea of running the entire input document
    through the text interpreter (wordlists or some other scheme would stop non-formatting-commands from being looked up as Forth words). But I
    later mostly lost interest in that.

    I guess in the text interpreter, the interpreter is at the beginning of
    a line iff IN> @ gives 0, and maybe Roff could use that.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From dxf@dxforth@gmail.com to comp.lang.forth on Fri Nov 21 13:33:51 2025
    From Newsgroup: comp.lang.forth

    On 21/11/2025 9:53 am, Paul Rubin wrote:
    dxf <dxforth@gmail.com> writes:
    When you say 'get around it', do you mean a broken line?

    The trouble distinguishing between a broken and an unbroken line when
    u1=line length. I think for roff though, it's ok to abandon the wish to handle arbitrarily long lines. The deficiency in the standard is not explaining READ-LINE's exact behaviour in this situation. I was able to experimentally resolve the issue in gforth, but other Forths might vary.

    My sense is anyone who uses REAL-LINE (or its equivalent in other langs) is doing so on the basis that the line read is complete i.e. arbitrarily long lines are not considered - and if one must - then use some other strategy.

    I had for a while liked the idea of running the entire input document
    through the text interpreter (wordlists or some other scheme would stop non-formatting-commands from being looked up as Forth words). But I
    later mostly lost interest in that.

    I guess in the text interpreter, the interpreter is at the beginning of
    a line iff IN> @ gives 0, and maybe Roff could use that.

    I seem to have enough tools and incentives to not use the text interpreter. OTOH if it's there and doesn't get in the way, I can understand why people do.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Fri Nov 21 08:15:52 2025
    From Newsgroup: comp.lang.forth

    Paul Rubin <no.email@nospam.invalid> writes:
    dxf <dxforth@gmail.com> writes:
    When you say 'get around it', do you mean a broken line?

    The trouble distinguishing between a broken and an unbroken line when
    u1=line length.

    There is no such trouble in standard systems. Such a line will be
    broken on such a system.

    The deficiency in the standard is not
    explaining READ-LINE's exact behaviour in this situation.

    The behaviour is specified exactly:

    |If a line terminator was received before u1 characters were read, then
    |u2 is the number of characters, not including the line terminator,
    |actually read [...]. When u1 = u2 the line terminator has
    |yet to be reached.

    So the first sentence tells you what happens if line lenght < u1. And
    the second sentence tells you what happens if line length >= u1.

    The deficiency in the standard is in the part that I elided: It says: (u<=i2<=u1). It does not really contradict that text, but it has
    misled a number of people (including you) into thinking that the first
    sentence also includes line length = u1. And the many questions about
    this issue show this deficiency.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
    EuroForth 2025 registration: https://euro.theforth.net/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Fri Nov 21 08:59:29 2025
    From Newsgroup: comp.lang.forth

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    The behaviour is specified exactly:

    |If a line terminator was received before u1 characters were read, then
    |u2 is the number of characters, not including the line terminator,
    |actually read [...]. When u1 = u2 the line terminator has
    |yet to be reached.

    So the first sentence tells you what happens if line lenght < u1. And
    the second sentence tells you what happens if line length >= u1.

    The deficiency in the standard is in the part that I elided: It says: >(u<=i2<=u1). It does not really contradict that text, but it has
    misled a number of people (including you) into thinking that the first >sentence also includes line length = u1. And the many questions about
    this issue show this deficiency.

    BTW, this has all been spelled out in

    <https://forth-standard.org/standard/file/READ-LINE#contribution-216>

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
    EuroForth 2025 registration: https://euro.theforth.net/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Fri Nov 21 11:35:02 2025
    From Newsgroup: comp.lang.forth

    In article <2025Nov21.091552@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    Paul Rubin <no.email@nospam.invalid> writes:
    dxf <dxforth@gmail.com> writes:
    When you say 'get around it', do you mean a broken line?

    The trouble distinguishing between a broken and an unbroken line when >>u1=line length.

    There is no such trouble in standard systems. Such a line will be
    broken on such a system.

    The kernel system is of low quality if it insists on the language
    design standard of the 70 such as frozen in CORE.

    QUIT uses (ACCEPT) in ciforth:
    ( -- sc )
    Accept characters from the terminal, until a RET is received and
    return the result as a constant string sc. It doesn't contain any
    line ending, but the buffer still does and after 1+ the string ends
    in a LF. The editing functions are the same as with ACCEPT .
    The buffer (and the resulting string) is limited to 16 K characters.

    [Note that this is exact fitting documentation, an msdos ciforth
    show a 64 char limit and 2 + to contain the strings end. ]

    With this and EXECUTE-PARSING or equivalent functionality the
    problems as sketched vanished.
    (ACCEPT) is as it where a redesigned READ-LINE leaning on the concept
    of a string constant ( addr len -- ) where the area that is passed
    is non-writable.


    The deficiency in the standard is not
    explaining READ-LINE's exact behaviour in this situation.

    The behaviour is specified exactly:

    |If a line terminator was received before u1 characters were read, then
    |u2 is the number of characters, not including the line terminator,
    |actually read [...]. When u1 = u2 the line terminator has
    |yet to be reached.

    The problem is that it builds on the 70' CORE and idea's. It is akward.

    <SNIP>

    - anton

    Groetjes Albert
    --
    The Chinese government is satisfied with its military superiority over USA.
    The next 5 year plan has as primary goal to advance life expectancy
    over 80 years, like Western Europe.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Fri Nov 21 11:20:33 2025
    From Newsgroup: comp.lang.forth

    albert@spenarnc.xs4all.nl writes:
    In article <2025Nov21.091552@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    The behaviour is specified exactly:

    |If a line terminator was received before u1 characters were read, then
    |u2 is the number of characters, not including the line terminator, >>|actually read [...]. When u1 = u2 the line terminator has
    |yet to be reached.

    The problem is that it builds on the 70' CORE and idea's. It is akward.

    The complexity of READ-LINE's specification comes from the following requirements:

    * Do not ALLOCATE (ALLOCATE is not guaranteed to be present, and
    probably is not present on small systems), so the caller of
    READ-LINE has to pass in buffer description.

    * Support arbitrarily long lines (that's a post-1970s attitude, BTW).

    * Report the end of the file.

    Did I forget a requirement?

    If the first requirement was dropped, the specification could become
    much simpler, e.g.,

    READ-ALLOC-LINE ( file-id -- c-addr u ior )

    If an error happens during the operation, return 0 0 n!=0; in that
    case the file position after the operation may be anywhere between
    the start of the line and end of the line (both included).

    If READ-ALLOC-LINE is called when the file position is at the end of
    the file, return 0 0 0.

    Otherwise, c-addr u describes the contents of the line (without
    terminator, even if there is one) and ior is 0. The line lives in
    ALLOCATEd data space and the caller of READ-ALLOC-LINE is
    responsible for FREEing it. After READ-ALLOC-LINE, the file
    position points at the start of the next line.

    READ-ALLOC-LINE can be implemented using READ-LINE, ALLOCATE and
    RESIZE. I fail to come up with a use of READ-LINE with fixed-size
    buffers and arbitrarily long lines that cannot be implemented just as
    well with READ-FILE and the knowledge what bytes may represent
    newlines.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
    EuroForth 2025 registration: https://euro.theforth.net/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Fri Nov 21 18:57:53 2025
    From Newsgroup: comp.lang.forth

    On 18-11-2025 00:25, Paul Rubin wrote:
    I'm playing with the idea of writing a Roff-like text formatter in
    Forth. The input is lines of text "blah blech, and this that the
    other...". The text lines can be arbitrarily long so I don't want to
    read the entire line into a memory buffer using something like REFILL.

    Let's say I don't have to worry about individual words overflowing
    memory though (segfault is not allowed, but it's ok to panic and quit).
    So the main loop will be to copy an input word to the output buffer and
    maybe flush the output buffer. The output buffer can be of fixed size.

    Also, some input lines will be formatting commands like ".i\n" (change
    font to italic). Those lines should be given to the Forth text
    interpreter.

    I guess I could use the FILE word set to write something like getc()
    with its own buffering, but that seems messy. I'm wondering if this is
    a common situation and there's an idiomatic solution.

    I don't know if it's "idiomatic", but it works. In essence, it reads the
    file binary. If there is something left at the end of the buffer, it
    copies that to the start, adjusts the buffer address and size and
    continues. It doesn't return a word per call, you open the file and it
    applies a quotation to each word parsed (a n --).

    No, it's not beautiful, but it works. BTW, if you happen to be German
    and your prose contains words that exceed 256 characters, you're on your
    own.

    Hans Bezemer

    ---8<---
    256 constant /line

    /line buffer: linebuf

    : eow?
    case
    bl of true endof
    9 of true endof
    10 of true endof
    13 of true endof
    false swap
    endcase
    ;
    \ correct for last word
    : ?lastword over 0= if linebuf swap chars + + else drop then over - ;
    : -leading begin dup while over c@ bl = while 1 /string repeat then ;
    \ get a word
    : get-word ( addr1 n2 -- addr2 n2 f)
    >r over 0 2swap bounds ?do i c@ eow? if i + leave then loop
    r> over >r ?lastword r> \ word is delimited by white space
    ;

    : parse-line ( xt a n --)
    dup >r begin \ save length
    2dup r@ -rot 2>r get-word ( xt a1 n1 a2 n2 n3)
    while \ if we read a complete word
    >r over r@ swap execute
    r> 2r> rot 1+ /string \ execute the action
    repeat 2rdrop rdrop \ adjust the buffer
    ;

    : open-txt s" netstrng.4th" r/o bin open-file abort" Cannot open
    'myfile.txt'" ;
    : adjust >r linebuf r@ cmove linebuf /line r> /string ;
    : close-txt close-file abort" Cannot close 'myfile.txt'" ;

    : parse-file ( h xt -- h)
    swap >r linebuf /line \ put xt on execution stack
    begin
    r@ read-file 0= over 0<> and \ read the file buffer
    while \ if not an empty line
    linebuf swap parse-line adjust \ parse line and adjust buffer
    repeat drop drop r> \ return handle
    ;

    open-txt [: -leading -trailing type cr ;] parse-file close-txt
    ---8<---


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Sat Nov 22 09:25:35 2025
    From Newsgroup: comp.lang.forth

    In article <nnd$6ed8c839$50daf997@634f347ff0045267>,
    Hans Bezemer <the.beez.speaks@gmail.com> wrote:
    On 18-11-2025 00:25, Paul Rubin wrote:
    I'm playing with the idea of writing a Roff-like text formatter in
    Forth. The input is lines of text "blah blech, and this that the
    other...". The text lines can be arbitrarily long so I don't want to
    read the entire line into a memory buffer using something like REFILL.

    Let's say I don't have to worry about individual words overflowing
    memory though (segfault is not allowed, but it's ok to panic and quit).
    So the main loop will be to copy an input word to the output buffer and
    maybe flush the output buffer. The output buffer can be of fixed size.

    Also, some input lines will be formatting commands like ".i\n" (change
    font to italic). Those lines should be given to the Forth text
    interpreter.

    I guess I could use the FILE word set to write something like getc()
    with its own buffering, but that seems messy. I'm wondering if this is
    a common situation and there's an idiomatic solution.

    I don't know if it's "idiomatic", but it works. In essence, it reads the
    file binary. If there is something left at the end of the buffer, it
    copies that to the start, adjusts the buffer address and size and
    continues. It doesn't return a word per call, you open the file and it >applies a quotation to each word parsed (a n --).

    That is exactly how it works in ciforth too for file redirection.
    (Reading from the console just fills the TIB.)


    No, it's not beautiful, but it works. BTW, if you happen to be German
    and your prose contains words that exceed 256 characters, you're on your
    own.

    Not beautiful?

    Hans Bezemer

    Groetjes Albert
    --
    The Chinese government is satisfied with its military superiority over USA.
    The next 5 year plan has as primary goal to advance life expectancy
    over 80 years, like Western Europe.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From dxf@dxforth@gmail.com to comp.lang.forth on Sun Nov 23 00:28:38 2025
    From Newsgroup: comp.lang.forth

    On 22/11/2025 4:57 am, Hans Bezemer wrote:
    On 18-11-2025 00:25, Paul Rubin wrote:
    ...
    I guess I could use the FILE word set to write something like getc()
    with its own buffering, but that seems messy.  I'm wondering if this is
    a common situation and there's an idiomatic solution.

    I don't know if it's "idiomatic", but it works. In essence, it reads the file binary. If there is something left at the end of the buffer, it copies that to the start, adjusts the buffer address and size and continues. It doesn't return a word per call, you open the file and it applies a quotation to each word parsed (a n --).

    No, it's not beautiful, but it works. BTW, if you happen to be German and your prose contains words that exceed 256 characters, you're on your own.
    ...

    I know it's cheating but then that's what libraries are for ;-)

    1 fload bfile
    \ readch ( -- c true | false )

    256 constant /word \ max word length

    /word reserve constant wordbuf

    : eow? ( c -- f )
    bl of true end
    9 of true end
    10 of true end
    13 of true end
    drop false
    ;

    : parse-file ( xt -- )
    >r 0 begin readch while
    dup eow? if
    drop wordbuf swap r@ execute 0
    else
    over wordbuf + c! 1+
    then
    repeat drop rdrop ;

    : .word ( a u -- ) ( -leading -trailing) type cr ;

    : run ( a u -- )
    r/o openin ['] .word parse-file closein ;



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Kerr-Mudd, John@admin@127.0.0.1 to comp.lang.forth on Mon Nov 24 19:01:15 2025
    From Newsgroup: comp.lang.forth

    On Wed, 19 Nov 2025 11:22:08 +0100
    albert@spenarnc.xs4all.nl wrote:

    In article <87o6oyrc7i.fsf@nightsong.com>,
    Paul Rubin <no.email@nospam.invalid> wrote:
    peter <peter.noreply@tin.it> writes:
    As Anton has already noted lxf uses this internally for parsing source files
    Mapping the file uses memory regions outside the current process.
    It is like first allocating memory and then reading in the entire file

    Yes mmap is a virtual memory thing though. My current thought is to use >READ-LINE with some fixed buffer size, but have two contiguous buffers
    and call READ-LINE twice, to handle the case where a word is split
    across two buffers. Then process all words (whitespace terminated)
    until the last whitespace is found. Anything left gets copied back
    to the beginning of the double buffer, before calling READ-LINE again.
    It's not worth bothering with true double buffering.

    The ciforth approach is more sensible. The terminal input buffer is
    filled from the input stream. It is large say 16K.
    Now you carve lines out of the buffer, and use a parse pointer,
    maybe >IN. As soon as you find that there are no more line endings
    in the remaining buffer, you copy the remainder to the start of
    the buffer and fill the buffer to the brim.
    In this case you will not copy more than is necessary.

    A good approach, IMHO.
    But how do you deal with the case where there's no line ending at EoF?
    ISTM an additional action is required for the last line.
    --
    Bah, and indeed Humbug.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Tue Nov 25 12:59:47 2025
    From Newsgroup: comp.lang.forth

    In article <20251124190115.f6256c9bdaacb9ac4603a4ea@127.0.0.1>,
    Kerr-Mudd, John <admin@127.0.0.1> wrote:
    On Wed, 19 Nov 2025 11:22:08 +0100
    albert@spenarnc.xs4all.nl wrote:

    In article <87o6oyrc7i.fsf@nightsong.com>,
    Paul Rubin <no.email@nospam.invalid> wrote:

    The ciforth approach is more sensible. The terminal input buffer is
    filled from the input stream. It is large say 16K.
    Now you carve lines out of the buffer, and use a parse pointer,
    maybe >IN. As soon as you find that there are no more line endings
    in the remaining buffer, you copy the remainder to the start of
    the buffer and fill the buffer to the brim.
    In this case you will not copy more than is necessary.

    A good approach, IMHO.
    But how do you deal with the case where there's no line ending at EoF?
    ISTM an additional action is required for the last line.

    There is an action at EoF regardless.


    --
    Bah, and indeed Humbug.
    --
    The Chinese government is satisfied with its military superiority over USA.
    The next 5 year plan has as primary goal to advance life expectancy
    over 80 years, like Western Europe.
    --- Synchronet 3.21a-Linux NewsLink 1.2