• Tcl 8.6 vs 9.0 encoding plus some general confusion

    From ted@loft.tnolan.com (Ted Nolan@tednolan to comp.lang.tcl on Sun Jun 22 20:05:43 2025
    From Newsgroup: comp.lang.tcl

    I am always finding things that, in retrospect, I don't understand
    as well as I thought I did. Perhaps someone can help me get my
    mind around the following...

    I was surprised recently that one of the servers in our
    cluster is running a Linux distribution so forward-looking
    that tclsh is symlinked to tclsh9.0 instead of tclsh8.6.

    This caused one of my scripts to fail (or to pitch a warning,
    which had the same effect in context) about encoding while
    reading data.

    As background, I am reading line oriented data that comes in as a
    number of fields separated by a field separator character (backslash
    to be specific). Everything up to the last field is pure text data.
    However data after the last separator can be binary data (with the
    caveat that some light encoding is done such that it will not have
    a newline character until the actual end of the record).

    For tcl8.6, I have been setting up to read this data with something
    like the following. (I can't give the actual code here,
    so bear with any typos):

    set f [open $file r]
    fconfigure $f -encoding binary -translation binary

    while {[gets $f line] >= 0} {
    do_stuff $line
    }
    close $f

    The direction on the fconfigure man page for 8.6 is:

    If a file contains pure binary data (for instance, a JPEG
    image), the encoding for the channel should be configured to be
    binary. Tcl will then assign no interpretation to the data in
    the file and simply read or write raw bytes. The Tcl binary
    command can be used to manipulate this byte-oriented data. It
    is usually better to set the -translation option to binary when
    you want to transfer binary data, as this turns off the other
    automatic interpretations of the bytes in the stream as well.

    My understanding was that all this indicates to Tcl that we are
    creating a byte array and it should not attempt to convert the
    data to the internal Unicode format.

    However the warning thrown by 9.0 points me to the "chan"
    man page which says for the "-encoding" option:

    If a file contains pure binary data (for instance, a JPEG
    image), the encoding for the channel should be configured
    to be iso8859-1. Tcl will then assign no interpretation to
    the data in the file and simply read or write raw bytes.
    The Tcl binary command can be used to manipulate this
    byte-oriented data. It is usually better to set the
    -translation option to binary when you want to transfer
    binary data, as this turns off the other automatic
    interpretations of the bytes in the stream as well.

    and I don't understand this at all. If I say "-encoding iso8859-1",
    am I not saying that the data is textual, and that Tcl should parse
    it from "iso8859-1" into the internal Unicode as it reads it?

    Also for both 8.6 & 9.0, when I search for the separator with "string
    index", and pull off the last field with "string range", am I forcing
    Tcl to consider the whole string as text such that the (potentially)
    binary portion at the end of the line is attempted to be converted
    to internal Unicode? I have never observed this, but thinking
    harder about it, maybe it should be, could be? Should I be touching
    the string only with "binary" commands?

    Big thanks for clearing up my thinking about this!
    --
    columbiaclosings.com
    What's not in Columbia anymore..
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Ralf Fassel@ralfixx@gmx.de to comp.lang.tcl on Mon Jun 23 12:06:16 2025
    From Newsgroup: comp.lang.tcl

    * ted@loft.tnolan.com (Ted Nolan <tednolan>)
    | However the warning thrown by 9.0 points me to the "chan"
    | man page which says for the "-encoding" option:

    | If a file contains pure binary data (for instance, a JPEG
    | image), the encoding for the channel should be configured
    | to be iso8859-1. Tcl will then assign no interpretation to
    | the data in the file and simply read or write raw bytes.
    | The Tcl binary command can be used to manipulate this
    | byte-oriented data. It is usually better to set the
    | -translation option to binary when you want to transfer
    | binary data, as this turns off the other automatic
    | interpretations of the bytes in the stream as well.

    | and I don't understand this at all. If I say "-encoding iso8859-1",
    | am I not saying that the data is textual, and that Tcl should parse
    | it from "iso8859-1" into the internal Unicode as it reads it?

    Looking at the TCL sources for 9.0 and 8.6, it seems that the 'binary'
    encoding always has been an alias for 'iso8859-1', which has finally
    been removed in TCL 9, cf. changes.md:

    ## Notable incompatibilities
    - Removed the encoding alias `binary` to `iso8859-1`.

    The code in tcl8.6 has several places where in the encoding context
    'binary' finally ends up as 'iso8859-1' (tclIO.c, static Tcl_Encoding GetBinaryEncoding(void) => tsdPtr->binaryEncoding = Tcl_GetEncoding(NULL, "iso8859-1");)

    The code in tcl9.0.1 has

    if ((newValue[0] == '\0') || !strcmp(newValue, "binary")) {
    if (interp) {
    Tcl_SetObjResult(interp, Tcl_ObjPrintf(
    "unknown encoding \"%s\": No longer supported.\n"
    "\tplease use either \"-translation binary\" "
    "or \"-encoding iso8859-1\"", newValue));
    }
    return TCL_ERROR;
    }
    i.e. raise an error if the encoding is set to "binary".

    Effectively nothing should have changed, except you can no longer
    say
    chan -encoding binary
    in tcl 9 (should have used "-translation binary" anyway).

    HTH
    R'
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Rich@rich@example.invalid to comp.lang.tcl on Mon Jun 23 15:59:22 2025
    From Newsgroup: comp.lang.tcl

    Ralf Fassel <ralfixx@gmx.de> wrote:
    and I don't understand this at all. If I say "-encoding
    iso8859-1", am I not saying that the data is textual, and that Tcl
    should parse it from "iso8859-1" into the internal Unicode as it
    reads it?

    Looking at the TCL sources for 9.0 and 8.6, it seems that the
    'binary' encoding always has been an alias for 'iso8859-1', which has finally been removed in TCL 9, cf. changes.md:

    ## Notable incompatibilities
    - Removed the encoding alias `binary` to `iso8859-1`.

    This feels like unnecesary exposure of internal details that an end
    user is not concerned about.

    A user wants to read "binary" data, it would seem that they would
    expect to use "binary" as the name for that "encoding" (well, really, a
    lack of any encoding). If it indeed was mapped to iso8859-1
    internally, that is an internal implemntation detail that is of no
    concern to them. Instead, I expect we will start to see a lot of
    confusion from user's wondering why they are setting a "character
    encoding" when they really wanted to read "binary" data.

    Effectively nothing should have changed, except you can no longer
    say
    chan -encoding binary
    in tcl 9 (should have used "-translation binary" anyway).

    Keeping the external "binary" alias visible would have been the better
    option in my opinion. Even if it was nothing more than an alias for iso8859-1.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Ralf Fassel@ralfixx@gmx.de to comp.lang.tcl on Mon Jun 23 18:35:26 2025
    From Newsgroup: comp.lang.tcl

    * Rich <rich@example.invalid>
    | Ralf Fassel <ralfixx@gmx.de> wrote:
    | > ## Notable incompatibilities
    | > - Removed the encoding alias `binary` to `iso8859-1`.

    | This feels like unnecesary exposure of internal details that an end
    | user is not concerned about.

    | A user wants to read "binary" data, it would seem that they would
    | expect to use "binary" as the name for that "encoding" (well, really, a
    | lack of any encoding). If it indeed was mapped to iso8859-1
    | internally, that is an internal implemntation detail that is of no
    | concern to them.

    The tcl-9 manpage is not clear on this topic IMHO:

    https://www.tcl-lang.org/man/tcl/TclCmd/chan.html

    On the one hand, it states for -encoding:
    https://www.tcl-lang.org/man/tcl/TclCmd/chan.html#M11

    If a file contains pure binary data (for instance, a JPEG image), the
    encoding for the channel should be configured to be iso8859-1. Tcl will
    then assign no interpretation to the data in the file and simply read or
    write raw bytes.

    Two sentences later:
    It is usually better to set the -translation option to binary when
    you want to transfer binary data, as this turns off the other
    automatic interpretations of the bytes in the stream as well.

    And for -translation:
    https://www.tcl-lang.org/man/tcl/TclCmd/chan.html#M17
    binary
    Like lf, no end-of-line translation is performed, but in addition,
    sets -eofchar to the empty string to disable it, and sets
    -encoding to iso8859-1.

    This sounds to me that configuring only "-encoding iso8859-1" is *not*
    enough to read binary data (since crlf translation and eofchar handling
    might still apply), and that the "usually better" should really read
    "necessary to".

    R'
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Harald Oehlmann@wortkarg3@yahoo.com to comp.lang.tcl on Tue Jun 24 08:55:21 2025
    From Newsgroup: comp.lang.tcl

    Am 23.06.2025 um 18:35 schrieb Ralf Fassel:
    * Rich <rich@example.invalid>
    | Ralf Fassel <ralfixx@gmx.de> wrote:
    | > ## Notable incompatibilities
    | > - Removed the encoding alias `binary` to `iso8859-1`.

    | This feels like unnecesary exposure of internal details that an end
    | user is not concerned about.

    | A user wants to read "binary" data, it would seem that they would
    | expect to use "binary" as the name for that "encoding" (well, really, a
    | lack of any encoding). If it indeed was mapped to iso8859-1
    | internally, that is an internal implemntation detail that is of no
    | concern to them.

    The tcl-9 manpage is not clear on this topic IMHO:

    https://www.tcl-lang.org/man/tcl/TclCmd/chan.html

    On the one hand, it states for -encoding:
    https://www.tcl-lang.org/man/tcl/TclCmd/chan.html#M11

    If a file contains pure binary data (for instance, a JPEG image), the
    encoding for the channel should be configured to be iso8859-1. Tcl will
    then assign no interpretation to the data in the file and simply read or
    write raw bytes.

    Two sentences later:
    It is usually better to set the -translation option to binary when
    you want to transfer binary data, as this turns off the other
    automatic interpretations of the bytes in the stream as well.

    And for -translation:
    https://www.tcl-lang.org/man/tcl/TclCmd/chan.html#M17
    binary
    Like lf, no end-of-line translation is performed, but in addition,
    sets -eofchar to the empty string to disable it, and sets
    -encoding to iso8859-1.

    This sounds to me that configuring only "-encoding iso8859-1" is *not*
    enough to read binary data (since crlf translation and eofchar handling
    might still apply), and that the "usually better" should really read "necessary to".

    R'

    Yes, "-translation binary" has also the advantage to work on 8.6 and 9.0.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Harald Oehlmann@wortkarg3@yahoo.com to comp.lang.tcl on Tue Jun 24 08:56:29 2025
    From Newsgroup: comp.lang.tcl

    Am 23.06.2025 um 17:59 schrieb Rich:
    Keeping the external "binary" alias visible would have been the better
    option in my opinion. Even if it was nothing more than an alias for iso8859-1.

    I am also your opinion.
    Unfortunately, this ship sailed already.
    It is always sad, if internal design wins over usability...
    Some people didn't like the additional if's...

    Sorry,
    Harald
    --- Synchronet 3.21a-Linux NewsLink 1.2