Forum: War Ensemble BBS

Idiomatic way to read a word of text from a file?

From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Mon Nov 17 15:25:20 2025

From Newsgroup: comp.lang.forth

I'm playing with the idea of writing a Roff-like text formatter in
Forth. The input is lines of text "blah blech, and this that the
other...". The text lines can be arbitrarily long so I don't want to
read the entire line into a memory buffer using something like REFILL.

Let's say I don't have to worry about individual words overflowing
memory though (segfault is not allowed, but it's ok to panic and quit).
So the main loop will be to copy an input word to the output buffer and
maybe flush the output buffer. The output buffer can be of fixed size.

Also, some input lines will be formatting commands like ".i\n" (change
font to italic). Those lines should be given to the Forth text
interpreter.

I guess I could use the FILE word set to write something like getc()
with its own buffering, but that seems messy. I'm wondering if this is
a common situation and there's an idiomatic solution.
--- Synchronet 3.21a-Linux NewsLink 1.2

From dxf@dxforth@gmail.com to comp.lang.forth on Tue Nov 18 13:21:48 2025

From Newsgroup: comp.lang.forth

On 18/11/2025 10:25 am, Paul Rubin wrote:

...
I guess I could use the FILE word set to write something like getc()
with its own buffering, but that seems messy.

Less messy when someone has already done it. See BFILE (buffered files)
and its documentation SFILE.TXT included in the zip below.

FILESYS2.ZIP

https://drive.google.com/drive/folders/1kh2WcPUc3hQpLcz7TQ-YQiowrozvxfGw

Some assembly required.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Tue Nov 18 07:54:14 2025

From Newsgroup: comp.lang.forth

Paul Rubin <no.email@nospam.invalid> writes:

I'm playing with the idea of writing a Roff-like text formatter in
Forth. The input is lines of text "blah blech, and this that the
other...". The text lines can be arbitrarily long so I don't want to
read the entire line into a memory buffer using something like REFILL.

Why not? For something like Roff (or TeX or Markdown etc.) the whole
input file easily fits into RAM, so a line would fit, too. The
question is if the Forth system supports long lines in REFILL.

To test this, I wrote the following program

.( : x dup . cr source nip <> if ." wrong length: " source . drop bye then ; ) cr

: gen ( n -- )
60 swap 0 ?do
dup 10 + 8 .r ." x" dup spaces cr
2* loop
drop ;

20 gen bye

And then generated a file /tmp/long-lines.4th as follows:

gforth ./gen-long-lines.4th >/tmp/long-lines.4th

The longest line is more the 31M characters long; a system that can
deal with that probably can deal with any line length for which it can
allocate memory. Testing various systems by loading this file, I see:

Gforth, lxf, and SwiftForth load the whole file.

iforth-5.1-mini returns to the command line after printing 970,
without error message. My guess is that iForth recognized that the
next line is too long for its input buffer and silently called the
command-line interpreter.

vfx64: the last line interpreted is the one with 970 characters, and
an error message "wrong length: 512" is shown. So vfx64 loaded the
first 512 bytes of the 970-byte line, and gave it to the text
interpreter. And then the interpreted program noticed that the line
length is wrong, printed the error message and left the Forth system.

Back to your question, we have at least three Forth systems capable of REFILLing long lines.

Also, some input lines will be formatting commands like ".i\n" (change
font to italic). Those lines should be given to the Forth text
interpreter.

And here's the signficance of REFILLing. You could pass everything to
the text interpreter, and install the following recognizer sequence:
First one that recognizes things like ".i\n", and second one that
recognizes everything and then processes the line as ordinary words.

An alternative solution I would use: slurp the whole file into memory,
then process line by line (because of your line-oriented commands): In
each line, check whether something like ".i\n" occurs (with the same
logic that the recognizer would use); if so extract that line and
EVALUATE it; if not, process the line as you do by default. I expect
this alternative to be slightly more work, mainly because the line
processing is not done automatically by the text interpreter.

I guess I could use the FILE word set to write something like getc()
with its own buffering, but that seems messy. I'm wondering if this is
a common situation and there's an idiomatic solution.

No, it's not common. If you have too little memory, the idiomatic
solution is to limit the line length (see VFX64, even though it has
enough memory).

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
EuroForth 2025 registration: https://euro.theforth.net/
--- Synchronet 3.21a-Linux NewsLink 1.2

From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Tue Nov 18 13:05:49 2025

From Newsgroup: comp.lang.forth

In article <871plws0m7.fsf@nightsong.com>,
Paul Rubin <no.email@nospam.invalid> wrote:

I'm playing with the idea of writing a Roff-like text formatter in
Forth. The input is lines of text "blah blech, and this that the
other...". The text lines can be arbitrarily long so I don't want to
read the entire line into a memory buffer using something like REFILL.

Let's say I don't have to worry about individual words overflowing
memory though (segfault is not allowed, but it's ok to panic and quit).
So the main loop will be to copy an input word to the output buffer and
maybe flush the output buffer. The output buffer can be of fixed size.

Also, some input lines will be formatting commands like ".i\n" (change
font to italic). Those lines should be given to the Forth text
interpreter.

I guess I could use the FILE word set to write something like getc()
with its own buffering, but that seems messy. I'm wondering if this is
a common situation and there's an idiomatic solution.

Going character by character gets you little speed. Mostly
GET-FILE ("slurp-file") is the way to go. Then can you make the file
current input stream.
SAVE RESTORE SET-SRC and parsing like PARSE-NAME PARSE are your friend.

EXECUTE-PARSING is similar (bit 10 years after SAVE SET-SRC RESTORE.):
: EXECUTE-PARSING ROT ROT SAVE SET-SRC CATCH RESTORE THROW ;

\ ----------------------
#!/usr/bin/lina -s
\ Copyright 2015 (c): Albert van der Horst, Dutch Forth Worksshop by GPL

\ wc , using all tricks ciforth has to offer, in script style.
\ Usage: wc.script <filenames>

ARGC 1 ?DO
1 ARG[] 2DUP TYPE SPACE
GET-FILE
2DUP 0 >R BEGIN ^J $/ 2DROP OVER WHILE R> 1+ >R REPEAT R> . 2DROP
2DUP SAVE SET-SRC 0 BEGIN NAME NIP WHILE 1+ REPEAT RESTORE .
2DUP . DROP
2DROP CR
SHIFT-ARGS
LOOP
\ ----------------------
(The -s option automatically loads argument handling and control
structures interpretation.)

Example :

albert@sinas2:~/PROJECT/ciriscv$ wc.script [a-h]*.frt
aap.frt 3 10 55
blocks.frt 3712 20819 109460
doit.frt 6 21 106
hello.frt 0 3 19
hellow.frt 1 8 40

See also advance/lispl.frt

https://github.com/albertvanderhorst/forthlisp

The definition of TOKEN that replaces NAME that is my WORD.

The transformation generates a lisp turnkey.

Groetjes Albert
--
The Chinese government is satisfied with its military superiority over USA.
The next 5 year plan has as primary goal to advance life expectancy
over 80 years, like Western Europe.
--- Synchronet 3.21a-Linux NewsLink 1.2

From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Tue Nov 18 13:50:29 2025

From Newsgroup: comp.lang.forth

In article <2025Nov18.085414@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
<SNIP>

No, it's not common. If you have too little memory, the idiomatic
solution is to limit the line length (see VFX64, even though it has
enough memory).

In ciforth the only restriction is that a word fits in the input
buffer. In the context of nroff it makes no sense to use REFILL.
There is no REFILL in the ciforth kernel.
QUIT itself cuts the input into lines, without copying:
`` REMAINDER 2@ ^J $/ '' giving a string variable ( addr len ).
If you parse yourself for e.g nroff you don't need to do that.
Even in the MSDOS version you can parse files where restricted
only in the length of a word (64 chars).
I consider the ^J / ^M as blank space.

- anton

Groetjes Albert
--
The Chinese government is satisfied with its military superiority over USA.
The next 5 year plan has as primary goal to advance life expectancy
over 80 years, like Western Europe.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Gerry Jackson@do-not-use@swldwa.uk to comp.lang.forth on Tue Nov 18 12:56:15 2025

From Newsgroup: comp.lang.forth

On 17/11/2025 23:25, Paul Rubin wrote:

I'm playing with the idea of writing a Roff-like text formatter in
Forth. The input is lines of text "blah blech, and this that the
other...". The text lines can be arbitrarily long so I don't want to
read the entire line into a memory buffer using something like REFILL.

Let's say I don't have to worry about individual words overflowing
memory though (segfault is not allowed, but it's ok to panic and quit).
So the main loop will be to copy an input word to the output buffer and
maybe flush the output buffer. The output buffer can be of fixed size.

Also, some input lines will be formatting commands like ".i\n" (change
font to italic). Those lines should be given to the Forth text
interpreter.

I guess I could use the FILE word set to write something like getc()
with its own buffering, but that seems messy. I'm wondering if this is
a common situation and there's an idiomatic solution.

Have you seen the Sam Falvo video at: https://www.youtube.com/watch?v=mvrE2ZGe-rs

where, from memory so it might be inaccurate in detail, he demonstrates
the development of a text preprocessor to convert items like ~bw to html <bold>

He slurps a text file for conversion into dataspace 4K bytes at a time,
using HERE as the start address, and using READ-FILE until READ-FILE
returns 0 bytes read. Then using HERE again calculates the size of the
data slurped in. So no buffer allocation is needed but, of course, a
really big file might run out of dataspace.

Also, what might be of interest is the way he develops an operator ===>
(I think) that maps ~bw into <bold> by writing:
~bw ===> <bold>
similarly for all other conversions to HTML. ISTR that ===> is
non-standard as it uses return stack manipulation.
--
Gerry
--- Synchronet 3.21a-Linux NewsLink 1.2

From peter@peter.noreply@tin.it to comp.lang.forth on Tue Nov 18 22:54:03 2025

From Newsgroup: comp.lang.forth

On Mon, 17 Nov 2025 15:25:20 -0800
Paul Rubin <no.email@nospam.invalid> wrote:

I'm playing with the idea of writing a Roff-like text formatter in
Forth. The input is lines of text "blah blech, and this that the
other...". The text lines can be arbitrarily long so I don't want to
read the entire line into a memory buffer using something like REFILL.

Let's say I don't have to worry about individual words overflowing
memory though (segfault is not allowed, but it's ok to panic and quit).
So the main loop will be to copy an input word to the output buffer and
maybe flush the output buffer. The output buffer can be of fixed size.

Also, some input lines will be formatting commands like ".i\n" (change
font to italic). Those lines should be given to the Forth text
interpreter.

I guess I could use the FILE word set to write something like getc()
with its own buffering, but that seems messy. I'm wondering if this is
a common situation and there's an idiomatic solution.

In lxf and lxf64 I have the following words to support processing files

MAP-FILE ( addr len fam -- a2 l2 ior )
UNMAP-FILE ( a2 l2 -- ior )
maps a file into memory, fam is r/o or r/w

GET-LINE ( a1 l1 -- a1 l3 a2 l2 )
GET-WORD ( a1 l1 -- a1 l3 a2 l2 )
takes a memory region, returns the first line/word on top of stack
and remaining region below it.

Here is a simple example to count lines and words in a file

\ Process a file

variable #words
variable #lines

: process-line ( a l -- )
1 #lines +!
begin
dup while
get-word 2drop 1 #words +!
repeat
2drop ;

: process-file ( a l -- )
r/o map-file throw
0 #words ! 0 #lines !
2dup
begin
dup while
get-line process-line
repeat
2drop
unmap-file throw
." the file has " #lines @ .
." lines and " #words @ .
." words!" ;

As Anton has already noted lxf uses this internally for parsing source files Mapping the file uses memory regions outside the current process.
It is like first allocating memory and then reading in the entire file

BR
Peter

--- Synchronet 3.21a-Linux NewsLink 1.2

From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Tue Nov 18 18:02:24 2025

From Newsgroup: comp.lang.forth

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Why not? For something like Roff (or TeX or Markdown etc.) the whole
input file easily fits into RAM, so a line would fit, too. The
question is if the Forth system supports long lines in REFILL.

The target processor might not have that much ram. Back in school I
used a version of Roff written for CP/M, which I ported to Turbo C for a
PC-XT clone. On the other hand, none of my input files had very long
lines. I was thinking of accomodating modern wysiwyg editors which
don't have line breaks except at the end of paragraphs. Maybe that's
not worthwhile.

One obvious approach is to use READ-LINE, but this unfortunately seems
to throw away the newline at the end of the line read, so it's hard to
tell if a complete line has been read, or if the buffer has simply
gotten full. Testing with gforth, if the buffer size is exactly the
line length, then FILE-POSITION points to just after the line, and the
next call to READ-LINE returns 0 chars. No idea about other Forths.

And here's the signficance of REFILLing. You could pass everything to
the text interpreter, and install the following recognizer sequence:
First one that recognizes things like ".i\n", and second one that
recognizes everything and then processes the line as ordinary words.

I'll see if I can figure out how to do that, though the target Forth
might not have recognizers. What I wanted is a loop like
LOOP
READ a line;
IF line begins with ".", then pass the line to the text interpreter;
ELSE loop through the words on the line, copying them to the output
buffer or maybe to the output device
END LOOP

Getting words from the line should preferably use Forth's built-in
parser.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Tue Nov 18 18:24:49 2025

From Newsgroup: comp.lang.forth

peter <peter.noreply@tin.it> writes:

As Anton has already noted lxf uses this internally for parsing source files Mapping the file uses memory regions outside the current process.
It is like first allocating memory and then reading in the entire file

Yes mmap is a virtual memory thing though. My current thought is to use READ-LINE with some fixed buffer size, but have two contiguous buffers
and call READ-LINE twice, to handle the case where a word is split
across two buffers. Then process all words (whitespace terminated)
until the last whitespace is found. Anything left gets copied back
to the beginning of the double buffer, before calling READ-LINE again.
It's not worth bothering with true double buffering.
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Wed Nov 19 07:27:22 2025

From Newsgroup: comp.lang.forth

Paul Rubin <no.email@nospam.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Why not? For something like Roff (or TeX or Markdown etc.) the whole
input file easily fits into RAM, so a line would fit, too. The
question is if the Forth system supports long lines in REFILL.

The target processor might not have that much ram.

"Might"? If you have a concrete target with small RAM (say, one of
the Mecrisp targets), it will come with additional restrictions, but
maybe also additional capabilities (like accessing the input file
directly in its flash storage), and all of this might change how you
approach the problem.

I was thinking of accomodating modern wysiwyg editors which
don't have line breaks except at the end of paragraphs. Maybe that's
not worthwhile.

If you have a machine where you run a WYSIWYG editor, you also have
enough RAM for keeping one line (and probably the whole text). The
systems with small RAM memory tend to have only a line editor (with
80-char lines), or maybe a screen editor (with 1KB screens).

One obvious approach is to use READ-LINE, but this unfortunately seems
to throw away the newline at the end of the line read, so it's hard to
tell if a complete line has been read, or if the buffer has simply
gotten full.

<https://forth-standard.org/standard/file/READ-LINE> says:
|When u1 = u2 the line terminator has yet to be reached.

Testing with gforth, if the buffer size is exactly the
line length, then FILE-POSITION points to just after the line, and the
next call to READ-LINE returns 0 chars. No idea about other Forths.

With u1=line length, the only way to satisfy the requirement above is
to deliver it as two parts, one with u2=u1, the other with u2=0.
READ-LINE and the case where u1=line length have been discussed
several times, so apparently it's not so clear to some how a system
should behave, so you may want to check the system you use, and report
a bug to the system implementor if it does not behave correctly.

As for FILE-POSITION, what I see in Gforth is (output after "\"):

s" /tmp/long-lines.4th" r/o open-file throw constant f \ ok
pad 74 f read-line throw . . \ -1 74 ok
f file-position throw ud. \ 74 ok
pad 70 f read-line throw . . \ -1 0 ok
f file-position throw ud. \ 75 ok

That's with a file with one-byte newlines.

And here's the signficance of REFILLing. You could pass everything to
the text interpreter, and install the following recognizer sequence:
First one that recognizes things like ".i\n", and second one that
recognizes everything and then processes the line as ordinary words.

I'll see if I can figure out how to do that, though the target Forth
might not have recognizers. What I wanted is a loop like
LOOP
READ a line;
IF line begins with ".", then pass the line to the text interpreter;
ELSE loop through the words on the line, copying them to the output
buffer or maybe to the output device
END LOOP

If you rely on REFILL (but then you have to INCLUDE the file, and have
an executable word at its start), you could implement that as:

: process-line ( -- )
begin
source nip >in @ u> while
parse-name type \ or whatever you want to do with words
repeat ;

: roff ( -- ) \ untested
begin
refill while
source if
c@ '.' = if
source evaluate [ 0 cs-pick ] again then
else
drop then
process-line
repeat ;

The alternative with the recognizers avoids the need to say ROFF at
the start of the file. Or you have a Forth system that implements EXECUTE-PARSING-FILE.

Alternatively, you could go for using READ-LINE. In that case I would
treat too-long lines as errors, and the result would look as follows:

80 constant line-length \ however long lines you have space to process
create line line-length 2 + allot

: process-line ( c-addr u -- )
... \ without PARSE-NAME support unless you use EXECUTE-PARSING
;

: roff {: file-id -- :}
begin
line line-length file-id read-line throw while
dup line-length >= abort" line too long"
dup if
line c@ '.' = if
line swap evaluate [ 0 cs-pick ] again then
then
line swap process-line
repeat ;

Getting words from the line should preferably use Forth's built-in
parser.

That means going through INCLUDE, EVALUATE, EXECUTE-PARSING, or EXECUTE-PARSING-FILE at some point.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
EuroForth 2025 registration: https://euro.theforth.net/
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Wed Nov 19 08:45:18 2025

From Newsgroup: comp.lang.forth

Paul Rubin <no.email@nospam.invalid> writes:

My current thought is to use
READ-LINE with some fixed buffer size, but have two contiguous buffers
and call READ-LINE twice, to handle the case where a word is split
across two buffers. Then process all words (whitespace terminated)
until the last whitespace is found. Anything left gets copied back
to the beginning of the double buffer, before calling READ-LINE again.

Given your original requirement of having a buffer big enough for a
word, but not for a line, I would not bother with READ-LINE.

If you don't mind slow speed, one way to go (as used by some Forth
systems for implementing READ-LINE) is to fill the buffer with
READ-FILE, find the start of a word, REPOSITION-FILE to that start,
READ-FILE the buffer, and you have your word in the buffer.
REPOSITION-FILE at the end of the word, and then repeat for the next
word.

A more efficient approach is to READ-FILE into a buffer of size >=1
(the larger, the more efficient), and copy from that buffer into a
word buffer when a word is found. After delivering a word, you have
to remember where in the buffer you were, and continue from there.
Whenever you reach the end of the buffer, refill it with READ-FILE.

In either case, when you find a newline (and you have to know how
newlines can be represented), check the next character after the
newline for ".", and deal with that. But "pass the line to the text interpreter" does not work if you have no space for a buffer that
contains a whole line.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
EuroForth 2025 registration: https://euro.theforth.net/
--- Synchronet 3.21a-Linux NewsLink 1.2

From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Wed Nov 19 11:22:08 2025

From Newsgroup: comp.lang.forth

In article <87o6oyrc7i.fsf@nightsong.com>,
Paul Rubin <no.email@nospam.invalid> wrote:

peter <peter.noreply@tin.it> writes:

As Anton has already noted lxf uses this internally for parsing source files >> Mapping the file uses memory regions outside the current process.
It is like first allocating memory and then reading in the entire file

Yes mmap is a virtual memory thing though. My current thought is to use >READ-LINE with some fixed buffer size, but have two contiguous buffers
and call READ-LINE twice, to handle the case where a word is split
across two buffers. Then process all words (whitespace terminated)
until the last whitespace is found. Anything left gets copied back
to the beginning of the double buffer, before calling READ-LINE again.
It's not worth bothering with true double buffering.

The ciforth approach is more sensible. The terminal input buffer is
filled from the input stream. It is large say 16K.
Now you carve lines out of the buffer, and use a parse pointer,
maybe >IN. As soon as you find that there are no more line endings
in the remaining buffer, you copy the remainder to the start of
the buffer and fill the buffer to the brim.
In this case you will not copy more than is necessary.

Groetjes Albert
--
The Chinese government is satisfied with its military superiority over USA.
The next 5 year plan has as primary goal to advance life expectancy
over 80 years, like Western Europe.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Wed Nov 19 17:43:19 2025

From Newsgroup: comp.lang.forth

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

"Might"? If you have a concrete target with small RAM (say, one of
the Mecrisp targets),

I was thinking Fuzix (fuzix.org) doesn't have a Roff or a Forth that I
know of, but it might be an interesting target. It runs on the
Raspberry Pi Pico (256K of ram and 2MB of flash) which is enough for
most Forth purposes.

If you have a machine where you run a WYSIWYG editor, you also have
enough RAM for keeping one line (and probably the whole text).

The idea is to format files that came from other people and other
machines. Though that's maybe kind of dumb because nobody cares about
Roff any more. It interests me because I used to use it for my school
papers. IDK if I have any of those files around any more though.

READ-LINE and the case where u1=line length have been discussed
several times, so apparently it's not so clear to some how a system
should behave,

That sounds like a deficiency in the standard, but anyway yes, there are
ways to get around it.

Alternatively, you could go for using READ-LINE. In that case I would
treat too-long lines as errors

Yeah I think that's the sanest approach. Another idea is to abandon
Roff syntax altogether, and go for an HTML subset or similar. I hate
Markdown but that's yet another possibility.

That means going through INCLUDE, EVALUATE, EXECUTE-PARSING, or EXECUTE-PARSING-FILE at some point.

I see. That can still can be ok of course. It's in the Forth spirit, I
think, to just do something hacky to detect whether the token read by PARSE-NAME is at the beginning of a line.
--- Synchronet 3.21a-Linux NewsLink 1.2

From dxf@dxforth@gmail.com to comp.lang.forth on Thu Nov 20 19:03:11 2025

From Newsgroup: comp.lang.forth

On 20/11/2025 12:43 pm, Paul Rubin wrote:

...

READ-LINE and the case where u1=line length have been discussed
several times, so apparently it's not so clear to some how a system
should behave,

That sounds like a deficiency in the standard, but anyway yes, there are
ways to get around it.

When you say 'get around it', do you mean a broken line? If so what can
be done with a broken line because AFAIK not much. I replaced 'flag' in READ-LINE with a trinary and I'm now looking for uses :-)

--- Synchronet 3.21a-Linux NewsLink 1.2

From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Thu Nov 20 12:49:32 2025

From Newsgroup: comp.lang.forth

In article <691ecb3e$1@news.ausics.net>, dxf <dxforth@gmail.com> wrote:

On 20/11/2025 12:43 pm, Paul Rubin wrote:

...

READ-LINE and the case where u1=line length have been discussed
several times, so apparently it's not so clear to some how a system
should behave,

That sounds like a deficiency in the standard, but anyway yes, there are
ways to get around it.

When you say 'get around it', do you mean a broken line? If so what can
be done with a broken line because AFAIK not much. I replaced 'flag' in >READ-LINE with a trinary and I'm now looking for uses :-)

The TIB in ciforth is huge, 1) and is read full as used as a buffer
in the context of redirecting.
Then split into lines (as long as you fancy doing this), and
if there is no more newline, copy the remainder to the start of
the buffer, and fill again. REFILL-TIB in this context is hardly
more complicated than a classic REFILL.
QUIT uses (ACCEPT) to cut up the input buffer, but since this
is an indirect threaded Forth (ACCEPT) can be revectored.
It is easy to add a cludge to detect `` ^J.i ''.

1) It doesn't matter if it is small, actually.

Groetjes Albert
--
The Chinese government is satisfied with its military superiority over USA.
The next 5 year plan has as primary goal to advance life expectancy
over 80 years, like Western Europe.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Thu Nov 20 14:53:12 2025

From Newsgroup: comp.lang.forth

dxf <dxforth@gmail.com> writes:

When you say 'get around it', do you mean a broken line?

The trouble distinguishing between a broken and an unbroken line when
u1=line length. I think for roff though, it's ok to abandon the wish to
handle arbitrarily long lines. The deficiency in the standard is not explaining READ-LINE's exact behaviour in this situation. I was able to experimentally resolve the issue in gforth, but other Forths might vary.

I had for a while liked the idea of running the entire input document
through the text interpreter (wordlists or some other scheme would stop non-formatting-commands from being looked up as Forth words). But I
later mostly lost interest in that.

I guess in the text interpreter, the interpreter is at the beginning of
a line iff IN> @ gives 0, and maybe Roff could use that.
--- Synchronet 3.21a-Linux NewsLink 1.2

From dxf@dxforth@gmail.com to comp.lang.forth on Fri Nov 21 13:33:51 2025

From Newsgroup: comp.lang.forth

On 21/11/2025 9:53 am, Paul Rubin wrote:

dxf <dxforth@gmail.com> writes:

When you say 'get around it', do you mean a broken line?

The trouble distinguishing between a broken and an unbroken line when
u1=line length. I think for roff though, it's ok to abandon the wish to handle arbitrarily long lines. The deficiency in the standard is not explaining READ-LINE's exact behaviour in this situation. I was able to experimentally resolve the issue in gforth, but other Forths might vary.

My sense is anyone who uses REAL-LINE (or its equivalent in other langs) is doing so on the basis that the line read is complete i.e. arbitrarily long lines are not considered - and if one must - then use some other strategy.

I had for a while liked the idea of running the entire input document
through the text interpreter (wordlists or some other scheme would stop non-formatting-commands from being looked up as Forth words). But I
later mostly lost interest in that.

I guess in the text interpreter, the interpreter is at the beginning of
a line iff IN> @ gives 0, and maybe Roff could use that.

I seem to have enough tools and incentives to not use the text interpreter. OTOH if it's there and doesn't get in the way, I can understand why people do.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Fri Nov 21 08:15:52 2025

From Newsgroup: comp.lang.forth

Paul Rubin <no.email@nospam.invalid> writes:

dxf <dxforth@gmail.com> writes:

When you say 'get around it', do you mean a broken line?

The trouble distinguishing between a broken and an unbroken line when
u1=line length.

There is no such trouble in standard systems. Such a line will be
broken on such a system.

The deficiency in the standard is not
explaining READ-LINE's exact behaviour in this situation.

The behaviour is specified exactly:

|If a line terminator was received before u1 characters were read, then
|u2 is the number of characters, not including the line terminator,
|actually read [...]. When u1 = u2 the line terminator has
|yet to be reached.

So the first sentence tells you what happens if line lenght < u1. And
the second sentence tells you what happens if line length >= u1.

The deficiency in the standard is in the part that I elided: It says: (u<=i2<=u1). It does not really contradict that text, but it has
misled a number of people (including you) into thinking that the first
sentence also includes line length = u1. And the many questions about
this issue show this deficiency.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
EuroForth 2025 registration: https://euro.theforth.net/
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Fri Nov 21 08:59:29 2025

From Newsgroup: comp.lang.forth

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

The behaviour is specified exactly:

|If a line terminator was received before u1 characters were read, then
|u2 is the number of characters, not including the line terminator,
|actually read [...]. When u1 = u2 the line terminator has
|yet to be reached.

So the first sentence tells you what happens if line lenght < u1. And
the second sentence tells you what happens if line length >= u1.

The deficiency in the standard is in the part that I elided: It says: >(u<=i2<=u1). It does not really contradict that text, but it has
misled a number of people (including you) into thinking that the first >sentence also includes line length = u1. And the many questions about
this issue show this deficiency.

BTW, this has all been spelled out in

<https://forth-standard.org/standard/file/READ-LINE#contribution-216>

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
EuroForth 2025 registration: https://euro.theforth.net/
--- Synchronet 3.21a-Linux NewsLink 1.2

From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Fri Nov 21 11:35:02 2025

From Newsgroup: comp.lang.forth

In article <2025Nov21.091552@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

Paul Rubin <no.email@nospam.invalid> writes:

dxf <dxforth@gmail.com> writes:

When you say 'get around it', do you mean a broken line?

The trouble distinguishing between a broken and an unbroken line when >>u1=line length.

There is no such trouble in standard systems. Such a line will be
broken on such a system.

The kernel system is of low quality if it insists on the language
design standard of the 70 such as frozen in CORE.

QUIT uses (ACCEPT) in ciforth:
( -- sc )
Accept characters from the terminal, until a RET is received and
return the result as a constant string sc. It doesn't contain any
line ending, but the buffer still does and after 1+ the string ends
in a LF. The editing functions are the same as with ACCEPT .
The buffer (and the resulting string) is limited to 16 K characters.

[Note that this is exact fitting documentation, an msdos ciforth
show a 64 char limit and 2 + to contain the strings end. ]

With this and EXECUTE-PARSING or equivalent functionality the
problems as sketched vanished.
(ACCEPT) is as it where a redesigned READ-LINE leaning on the concept
of a string constant ( addr len -- ) where the area that is passed
is non-writable.

The deficiency in the standard is not
explaining READ-LINE's exact behaviour in this situation.

The behaviour is specified exactly:

|If a line terminator was received before u1 characters were read, then
|u2 is the number of characters, not including the line terminator,
|actually read [...]. When u1 = u2 the line terminator has
|yet to be reached.

The problem is that it builds on the 70' CORE and idea's. It is akward.

<SNIP>

- anton

Groetjes Albert
--
The Chinese government is satisfied with its military superiority over USA.
The next 5 year plan has as primary goal to advance life expectancy
over 80 years, like Western Europe.
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Fri Nov 21 11:20:33 2025

From Newsgroup: comp.lang.forth

albert@spenarnc.xs4all.nl writes:

In article <2025Nov21.091552@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

The behaviour is specified exactly:

|If a line terminator was received before u1 characters were read, then
|u2 is the number of characters, not including the line terminator, >>|actually read [...]. When u1 = u2 the line terminator has
|yet to be reached.

The problem is that it builds on the 70' CORE and idea's. It is akward.

The complexity of READ-LINE's specification comes from the following requirements:

* Do not ALLOCATE (ALLOCATE is not guaranteed to be present, and
probably is not present on small systems), so the caller of
READ-LINE has to pass in buffer description.

* Support arbitrarily long lines (that's a post-1970s attitude, BTW).

* Report the end of the file.

Did I forget a requirement?

If the first requirement was dropped, the specification could become
much simpler, e.g.,

READ-ALLOC-LINE ( file-id -- c-addr u ior )

If an error happens during the operation, return 0 0 n!=0; in that
case the file position after the operation may be anywhere between
the start of the line and end of the line (both included).

If READ-ALLOC-LINE is called when the file position is at the end of
the file, return 0 0 0.

Otherwise, c-addr u describes the contents of the line (without
terminator, even if there is one) and ior is 0. The line lives in
ALLOCATEd data space and the caller of READ-ALLOC-LINE is
responsible for FREEing it. After READ-ALLOC-LINE, the file
position points at the start of the next line.

READ-ALLOC-LINE can be implemented using READ-LINE, ALLOCATE and
RESIZE. I fail to come up with a use of READ-LINE with fixed-size
buffers and arbitrarily long lines that cannot be implemented just as
well with READ-FILE and the knowledge what bytes may represent
newlines.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
EuroForth 2025 registration: https://euro.theforth.net/
--- Synchronet 3.21a-Linux NewsLink 1.2

From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Fri Nov 21 18:57:53 2025

From Newsgroup: comp.lang.forth

On 18-11-2025 00:25, Paul Rubin wrote:

I'm playing with the idea of writing a Roff-like text formatter in
Forth. The input is lines of text "blah blech, and this that the
other...". The text lines can be arbitrarily long so I don't want to
read the entire line into a memory buffer using something like REFILL.

Let's say I don't have to worry about individual words overflowing
memory though (segfault is not allowed, but it's ok to panic and quit).
So the main loop will be to copy an input word to the output buffer and
maybe flush the output buffer. The output buffer can be of fixed size.

Also, some input lines will be formatting commands like ".i\n" (change
font to italic). Those lines should be given to the Forth text
interpreter.

I guess I could use the FILE word set to write something like getc()
with its own buffering, but that seems messy. I'm wondering if this is
a common situation and there's an idiomatic solution.

I don't know if it's "idiomatic", but it works. In essence, it reads the
file binary. If there is something left at the end of the buffer, it
copies that to the start, adjusts the buffer address and size and
continues. It doesn't return a word per call, you open the file and it
applies a quotation to each word parsed (a n --).

No, it's not beautiful, but it works. BTW, if you happen to be German
and your prose contains words that exceed 256 characters, you're on your
own.

Hans Bezemer

---8<---
256 constant /line

/line buffer: linebuf

: eow?
case
bl of true endof
9 of true endof
10 of true endof
13 of true endof
false swap
endcase
;
\ correct for last word
: ?lastword over 0= if linebuf swap chars + + else drop then over - ;
: -leading begin dup while over c@ bl = while 1 /string repeat then ;
\ get a word
: get-word ( addr1 n2 -- addr2 n2 f)
>r over 0 2swap bounds ?do i c@ eow? if i + leave then loop
r> over >r ?lastword r> \ word is delimited by white space
;

: parse-line ( xt a n --)
dup >r begin \ save length
2dup r@ -rot 2>r get-word ( xt a1 n1 a2 n2 n3)
while \ if we read a complete word
>r over r@ swap execute
r> 2r> rot 1+ /string \ execute the action
repeat 2rdrop rdrop \ adjust the buffer
;

: open-txt s" netstrng.4th" r/o bin open-file abort" Cannot open
'myfile.txt'" ;
: adjust >r linebuf r@ cmove linebuf /line r> /string ;
: close-txt close-file abort" Cannot close 'myfile.txt'" ;

: parse-file ( h xt -- h)
swap >r linebuf /line \ put xt on execution stack
begin
r@ read-file 0= over 0<> and \ read the file buffer
while \ if not an empty line
linebuf swap parse-line adjust \ parse line and adjust buffer
repeat drop drop r> \ return handle
;

open-txt [: -leading -trailing type cr ;] parse-file close-txt
---8<---

--- Synchronet 3.21a-Linux NewsLink 1.2

From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Sat Nov 22 09:25:35 2025

From Newsgroup: comp.lang.forth

In article <nnd$6ed8c839$50daf997@634f347ff0045267>,
Hans Bezemer <the.beez.speaks@gmail.com> wrote:

On 18-11-2025 00:25, Paul Rubin wrote:

I'm playing with the idea of writing a Roff-like text formatter in
Forth. The input is lines of text "blah blech, and this that the
other...". The text lines can be arbitrarily long so I don't want to
read the entire line into a memory buffer using something like REFILL.

Let's say I don't have to worry about individual words overflowing
memory though (segfault is not allowed, but it's ok to panic and quit).
So the main loop will be to copy an input word to the output buffer and
maybe flush the output buffer. The output buffer can be of fixed size.

Also, some input lines will be formatting commands like ".i\n" (change
font to italic). Those lines should be given to the Forth text
interpreter.

I guess I could use the FILE word set to write something like getc()
with its own buffering, but that seems messy. I'm wondering if this is
a common situation and there's an idiomatic solution.

I don't know if it's "idiomatic", but it works. In essence, it reads the
file binary. If there is something left at the end of the buffer, it
copies that to the start, adjusts the buffer address and size and
continues. It doesn't return a word per call, you open the file and it >applies a quotation to each word parsed (a n --).

That is exactly how it works in ciforth too for file redirection.
(Reading from the console just fills the TIB.)

No, it's not beautiful, but it works. BTW, if you happen to be German
and your prose contains words that exceed 256 characters, you're on your
own.

Not beautiful?

Hans Bezemer

Groetjes Albert
--
The Chinese government is satisfied with its military superiority over USA.
The next 5 year plan has as primary goal to advance life expectancy
over 80 years, like Western Europe.
--- Synchronet 3.21a-Linux NewsLink 1.2

From dxf@dxforth@gmail.com to comp.lang.forth on Sun Nov 23 00:28:38 2025

From Newsgroup: comp.lang.forth

On 22/11/2025 4:57 am, Hans Bezemer wrote:

On 18-11-2025 00:25, Paul Rubin wrote:

...
I guess I could use the FILE word set to write something like getc()
with its own buffering, but that seems messy. I'm wondering if this is
a common situation and there's an idiomatic solution.

I don't know if it's "idiomatic", but it works. In essence, it reads the file binary. If there is something left at the end of the buffer, it copies that to the start, adjusts the buffer address and size and continues. It doesn't return a word per call, you open the file and it applies a quotation to each word parsed (a n --).

No, it's not beautiful, but it works. BTW, if you happen to be German and your prose contains words that exceed 256 characters, you're on your own.
...

I know it's cheating but then that's what libraries are for ;-)

1 fload bfile
\ readch ( -- c true | false )

256 constant /word \ max word length

/word reserve constant wordbuf

: eow? ( c -- f )
bl of true end
9 of true end
10 of true end
13 of true end
drop false
;

: parse-file ( xt -- )
>r 0 begin readch while
dup eow? if
drop wordbuf swap r@ execute 0
else
over wordbuf + c! 1+
then
repeat drop rdrop ;

: .word ( a u -- ) ( -leading -trailing) type cr ;

: run ( a u -- )
r/o openin ['] .word parse-file closein ;

--- Synchronet 3.21a-Linux NewsLink 1.2

From Kerr-Mudd, John@admin@127.0.0.1 to comp.lang.forth on Mon Nov 24 19:01:15 2025

From Newsgroup: comp.lang.forth

On Wed, 19 Nov 2025 11:22:08 +0100
albert@spenarnc.xs4all.nl wrote:

In article <87o6oyrc7i.fsf@nightsong.com>,
Paul Rubin <no.email@nospam.invalid> wrote:

peter <peter.noreply@tin.it> writes:

As Anton has already noted lxf uses this internally for parsing source files
Mapping the file uses memory regions outside the current process.
It is like first allocating memory and then reading in the entire file

Yes mmap is a virtual memory thing though. My current thought is to use >READ-LINE with some fixed buffer size, but have two contiguous buffers
and call READ-LINE twice, to handle the case where a word is split
across two buffers. Then process all words (whitespace terminated)
until the last whitespace is found. Anything left gets copied back
to the beginning of the double buffer, before calling READ-LINE again.
It's not worth bothering with true double buffering.

The ciforth approach is more sensible. The terminal input buffer is
filled from the input stream. It is large say 16K.
Now you carve lines out of the buffer, and use a parse pointer,
maybe >IN. As soon as you find that there are no more line endings
in the remaining buffer, you copy the remainder to the start of
the buffer and fill the buffer to the brim.
In this case you will not copy more than is necessary.

A good approach, IMHO.
But how do you deal with the case where there's no line ending at EoF?
ISTM an additional action is required for the last line.
--
Bah, and indeed Humbug.
--- Synchronet 3.21a-Linux NewsLink 1.2

From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Tue Nov 25 12:59:47 2025

From Newsgroup: comp.lang.forth

In article <20251124190115.f6256c9bdaacb9ac4603a4ea@127.0.0.1>,
Kerr-Mudd, John <admin@127.0.0.1> wrote:

On Wed, 19 Nov 2025 11:22:08 +0100
albert@spenarnc.xs4all.nl wrote:

In article <87o6oyrc7i.fsf@nightsong.com>,
Paul Rubin <no.email@nospam.invalid> wrote:

The ciforth approach is more sensible. The terminal input buffer is
filled from the input stream. It is large say 16K.
Now you carve lines out of the buffer, and use a parse pointer,
maybe >IN. As soon as you find that there are no more line endings
in the remaining buffer, you copy the remainder to the start of
the buffer and fill the buffer to the brim.
In this case you will not copy more than is necessary.

A good approach, IMHO.
But how do you deal with the case where there's no line ending at EoF?
ISTM an additional action is required for the last line.

There is an action at EoF regardless.

--
Bah, and indeed Humbug.

--
The Chinese government is satisfied with its military superiority over USA.
The next 5 year plan has as primary goal to advance life expectancy
over 80 years, like Western Europe.
--- Synchronet 3.21a-Linux NewsLink 1.2

Who's Online
Recent Visitors
- Ptb1970
  Sat Dec 13 17:34:42 2025
  from Wisconsin via Telnet
- Microbot
  Sat Dec 13 17:04:31 2025
  from Moore, Ok via Telnet
- John F Kennedy
  Fri Dec 12 21:48:00 2025
  from Crazyworldbbs.Com:2323 via Telnet
- Microbot
  Fri Dec 12 18:16:00 2025
  from Moore, Ok via Telnet

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,089
Nodes:	10 (0 / 10)
Uptime:	153:54:29
Calls:	13,921
Calls today:	2
Files:	187,021
D/L today:	3,760 files (944M bytes)
Messages:	2,457,163

Idiomatic way to read a word of text from a file?

Who's Online

Recent Visitors

System Info