• is_binary_file()

    From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Sat Dec 6 01:05:44 2025
    From Newsgroup: comp.lang.c

    Am I close? Missing anything you'd consider to be (or not) needed?

    <stdio.h>

    /*
    * Checks if a file is likely a binary by examining its content
    * for NULL bytes (0x00) or unusual control characters.
    * Returns 0 if text, 1 if binary or file open failure.
    */

    int is_binary_file(const char *path) {
    FILE *f = fopen(path, "rb");
    if (!f) return 1; // cannot open file, treat as error/fail check

    unsigned char buf[65536];
    size_t n, i;

    while ((n = fread(buf, 1, sizeof(buf), f)) > 0) {
    for (i = 0; i < n; i++) {
    unsigned char c = buf[i];

    // 1. check for the NULL byte (strong indicator of binary data)
    if (c == 0x00) {
    fclose(f);
    return 1; // IS binary
    }

    // 2. check for C0 control codes (0x01-0x1F), excluding known
    // text formatting characters: 0x09 (Tab), 0x0A (LF), 0x0D (CR)
    if (c < 0x20) {
    if (c != 0x09 && c != 0x0A && c != 0x0D) {
    fclose(f);
    return 1; // IS binary (contains unexpected control code)
    }
    }
    }
    }

    fclose(f);
    return 0; // NOT binary
    }
    --
    :wq
    Mike Sanders
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lew Pitcher@lew.pitcher@digitalfreehold.ca to comp.lang.c on Sat Dec 6 01:41:28 2025
    From Newsgroup: comp.lang.c

    On Sat, 06 Dec 2025 01:05:44 +0000, Michael Sanders wrote:

    Am I close? Missing anything you'd consider to be (or not) needed?

    <stdio.h>

    /*
    * Checks if a file is likely a binary by examining its content
    * for NULL bytes (0x00) or unusual control characters.
    * Returns 0 if text, 1 if binary or file open failure.
    */

    First off, until we get computers that store file data in formats
    other than binary, /all/ files (text or not) are "binary" files
    (meaning that an is_binary_file() function should always return true).
    OTOH, "text files" are a distinguishable subset of binary files.
    I suggest that this makes an "is_text_file()" function more valuable
    and more fitting than an "is_binary_file()" function.

    Secondly, ISTM that the function should return a unique failure value
    rather than overload the "is binary" return value. After all, you
    actually have three return values: is_text, is_not_text, and
    is_indeterminate (because of file access failure).

    Thirdly, your determination of whether or not the file contains text
    seemingly depends only on the existence or absence of certain control characters. But text isn't just control characters; so you need a test
    for invalid non-control characters as well. And, IIRC, not all control characters occupy the ASCII/Unicode C0 band, so you might have to expand
    your "acceptable control character" test to include some of those other
    control codes.

    Finally, you've hardcoded the binary values for certain acceptable ASCII/Unicode control characters. However, not all platforms use ASCII
    or Unicode, and these tests would fail to test the corresponding character value correctly (I think here of EBCDIC, where "Line Feed" doesn't exist
    but it's equivalent "NewLine" is 0x15 and Horizontal Tab is 0x05). Better
    here to use the C equivalent escape characters '\n' and '\t' instead.
    You may also consider expanding the control-character test to include other line-formatting characters (at least as far as C will allow): Vertical Tab ('\v'), Form Feed ('\f'), Carriage Return ('\r') and Backspace ('\b').


    int is_binary_file(const char *path) {
    FILE *f = fopen(path, "rb");
    if (!f) return 1; // cannot open file, treat as error/fail check

    unsigned char buf[65536];
    size_t n, i;

    while ((n = fread(buf, 1, sizeof(buf), f)) > 0) {
    for (i = 0; i < n; i++) {
    unsigned char c = buf[i];

    // 1. check for the NULL byte (strong indicator of binary data)
    if (c == 0x00) {
    fclose(f);
    return 1; // IS binary
    }

    // 2. check for C0 control codes (0x01-0x1F), excluding known
    // text formatting characters: 0x09 (Tab), 0x0A (LF), 0x0D (CR)
    if (c < 0x20) {
    if (c != 0x09 && c != 0x0A && c != 0x0D) {
    fclose(f);
    return 1; // IS binary (contains unexpected control code)
    }
    }
    }
    }

    fclose(f);
    return 0; // NOT binary
    }
    --
    Lew Pitcher
    "In Skills We Trust"
    Not LLM output - I'm just like this.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Keith Thompson@Keith.S.Thompson+u@gmail.com to comp.lang.c on Fri Dec 5 17:42:30 2025
    From Newsgroup: comp.lang.c

    Michael Sanders <porkchop@invalid.foo> writes:
    Am I close? Missing anything you'd consider to be (or not) needed?

    There is no completely reliable way to do this, but you might be
    able to make a reasonable guess. A binary file might happen to
    contain only byte values that represent printable characters.

    <stdio.h>

    /*
    * Checks if a file is likely a binary by examining its content
    * for NULL bytes (0x00) or unusual control characters.
    * Returns 0 if text, 1 if binary or file open failure.
    */

    Please use the term "null bytes", not "NULL bytes". NULL is a standard
    macro that expands to a null pointer constant.

    int is_binary_file(const char *path) {
    FILE *f = fopen(path, "rb");
    if (!f) return 1; // cannot open file, treat as error/fail check

    It seems odd to say that a file is assumed to be binary if you can't
    open it. I suggest having the function return more than two distinct
    values:

    - File seems to be binary
    - File seems to be text
    - Could be either
    - Something went wrong

    An enum is probably a good choice.

    unsigned char buf[65536];
    size_t n, i;

    while ((n = fread(buf, 1, sizeof(buf), f)) > 0) {

    Since you're only looking at individual characters, you might as well
    read one character at a time. The stdio functions will buffer the input
    for you, so there won't be much loss of performance.

    for (i = 0; i < n; i++) {
    unsigned char c = buf[i];

    // 1. check for the NULL byte (strong indicator of binary
    data)

    "null byte", not "NULL byte".

    if (c == 0x00) {
    fclose(f);
    return 1; // IS binary
    }

    // 2. check for C0 control codes (0x01-0x1F), excluding known
    // text formatting characters: 0x09 (Tab), 0x0A (LF), 0x0D (CR)
    if (c < 0x20) {
    if (c != 0x09 && c != 0x0A && c != 0x0D) {

    This test will detect '\0' bytes, making your first check redundant.

    fclose(f);
    return 1; // IS binary (contains unexpected control code)
    }

    You're assuming an ASCII-based character set, which is very
    probably a safe assumption. But I'd suggest replacing most of
    the hex constants with character constants. Aside from being more
    portable (realistically EBCDIC systems are the only case where it
    will matter), it makes the code more readable. And things like
    UTF-8 and UTF-16 make things a lot more complicated.

    0x00 -> '\0'
    0x20 -> ' '
    0x09 -> '\t'
    0x0A -> '\n'
    0x0D -> '\r'

    }
    }
    }

    fclose(f);

    fclose(f) can fail. That's not likely, but you should check.

    return 0; // NOT binary
    }

    You treat an empty file as text. That's not entirely unreasonable,
    but you should at least document it.

    You assume that a binary file is one that contains any byte values
    in the range 0..31 other than '\t', '\n', and '\r'. So a "text"
    file can't contain formfeed characters (debatable), but it can
    contain DEL characters and anything above 127.

    For Latin-1, values from 0xa0 to 0xff are printable (0xa0 is
    NO-BREAK SPACE, so that might be debatable). For UTF-8, bytes with
    values 0x80 and higher can be valid, but only in certain contexts.
    And so on.

    Depending on how far you want to get into it, distinguishing between
    text and binary files is anywhere from difficult to literally
    impossible.

    Take a look at the "file" command.
    --
    Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
    void Void(void) { Void(); } /* The recursive call of the void */
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lew Pitcher@lew.pitcher@digitalfreehold.ca to comp.lang.c on Sat Dec 6 02:00:22 2025
    From Newsgroup: comp.lang.c

    On Sat, 06 Dec 2025 01:41:28 +0000, Lew Pitcher wrote:

    On Sat, 06 Dec 2025 01:05:44 +0000, Michael Sanders wrote:

    Am I close? Missing anything you'd consider to be (or not) needed?

    <stdio.h>

    /*
    * Checks if a file is likely a binary by examining its content
    * for NULL bytes (0x00) or unusual control characters.
    * Returns 0 if text, 1 if binary or file open failure.
    */

    First off, until we get computers that store file data in formats
    other than binary, /all/ files (text or not) are "binary" files
    (meaning that an is_binary_file() function should always return true).
    OTOH, "text files" are a distinguishable subset of binary files.
    I suggest that this makes an "is_text_file()" function more valuable
    and more fitting than an "is_binary_file()" function.

    Secondly, ISTM that the function should return a unique failure value
    rather than overload the "is binary" return value. After all, you
    actually have three return values: is_text, is_not_text, and is_indeterminate (because of file access failure).
    [snip]

    I should have added that I feel that you probably haven't really
    defined /what/ "text file" means, and that has interfered with
    the development of this function. As Keith pointed out, the task
    of distinguishing between a "text" file and a "binary" file is not
    easy. I'll add that a lot of the difficulty stems from the fact
    that there are many definitions (some conflicting) of what a "text"
    file actually contains.

    The best advice I can give here is that you should pick a definition
    of what a text file consists of, document /that/ definition, and
    use /that/ documentation to build your code. If you say that, for
    instance, EBCDIC is out of scope, then your code does not have to
    handle EBCDIC (but if you /don't/ say that, then you leave your code
    open to the ambiguity of whether or not it will work with EBCDIC).
    Likewise for ASCII or "Extended ASCII" (sic) or Unicode (or 6Bit
    (multiple different choices here) or Baudot or even Morse).

    With suitable definitions beforehand, you can write an acceptable "is_text_file()" function and/or a passable "is_binary_file()"
    function.

    HTH
    --
    Lew Pitcher
    "In Skills We Trust"
    Not LLM output - I'm just like this.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Kaz Kylheku@046-301-5902@kylheku.com to comp.lang.c on Sat Dec 6 02:42:39 2025
    From Newsgroup: comp.lang.c

    On 2025-12-06, Michael Sanders <porkchop@invalid.foo> wrote:
    Am I close? Missing anything you'd consider to be (or not) needed?

    <stdio.h>

    /*
    * Checks if a file is likely a binary by examining its content
    * for NULL bytes (0x00) or unusual control characters.
    * Returns 0 if text, 1 if binary or file open failure.
    */

    int is_binary_file(const char *path) {

    [ ... ]

    fclose(f);
    return 0; // NOT binary
    }

    How about:

    int is_binary_file(const char *path)
    {
    FILE *f = fopen(path);
    int yes = 0;

    if (f) {
    int ch;

    while ((ch == getc(f)) != EOF) {
    for (int i = 0; i < CHAR_BIT; i++, ch >>= 1) {
    switch ((ch & 1)) {
    case 0:
    case 1:
    break;
    default:
    goto out;
    }
    }
    }

    // TODO: distinguish feof/ferror
    yes = 1;
    out:

    fclose(f);
    }

    return yes;
    }
    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Paul@nospam@needed.invalid to comp.lang.c on Sat Dec 6 03:14:55 2025
    From Newsgroup: comp.lang.c

    On Fri, 12/5/2025 8:05 PM, Michael Sanders wrote:
    Am I close? Missing anything you'd consider to be (or not) needed?

    <stdio.h>

    /*
    * Checks if a file is likely a binary by examining its content
    * for NULL bytes (0x00) or unusual control characters.
    * Returns 0 if text, 1 if binary or file open failure.
    */

    int is_binary_file(const char *path) {
    FILE *f = fopen(path, "rb");
    if (!f) return 1; // cannot open file, treat as error/fail check

    unsigned char buf[65536];
    size_t n, i;

    while ((n = fread(buf, 1, sizeof(buf), f)) > 0) {
    for (i = 0; i < n; i++) {
    unsigned char c = buf[i];

    // 1. check for the NULL byte (strong indicator of binary data)
    if (c == 0x00) {
    fclose(f);
    return 1; // IS binary
    }

    // 2. check for C0 control codes (0x01-0x1F), excluding known
    // text formatting characters: 0x09 (Tab), 0x0A (LF), 0x0D (CR)
    if (c < 0x20) {
    if (c != 0x09 && c != 0x0A && c != 0x0D) {
    fclose(f);
    return 1; // IS binary (contains unexpected control code)
    }
    }
    }
    }

    fclose(f);
    return 0; // NOT binary
    }


    It is the year 2025.

    How many times do you suppose someone has considered this question ?

    I'm not trying to be a smart ass by saying this, just that the
    question is bound to be nuanced. You can do a fast and totally
    inaccurate determination. You can do a computationally expensive
    or I/O expensive determination.

    There has to be a reason for doing this, and a damn good reason.

    *******

    There is the "file" command.

    It was invented in 1973.

    https://en.wikipedia.org/wiki/File_%28command%29

    The beauty of this command, is it has some sort of ordered
    approach to file determination.

    Originally, as I understand it (I don't see it in the Wiki), it
    was not supposed to read more than 1024 bytes of the file. This
    was because the command was intended to settle file determinations
    for "ordered types". For example, an MSWord doc, might have four
    unique bytes near the beginning of the file. The designers felt
    they could quickly "sort" or "determine" what kind of highly
    stylized file they were dealing with.

    But the results I got one day a couple years ago, suggests
    they have strayed from that. I got around 100 different text
    file declarations. For example, a text file with a binary block
    in it as a "corruption", it is declared as a text file, but
    the word "ISO something or other" is part of the file type
    determination. Thus, when I see a certain file on my computer
    is no longer a plain text file, but contains the word ISO,
    then I must scroll through it with a hex editor and see what
    the hell has triggered this determination.

    The experience suggested the entire text file was being read.
    I did not craft any tests to see if that was true.

    Some file types receive very little differentiation. There is
    only the one detection for them, the detection offers no help
    for technical people.

    That's an exemplar of a still-supported effort to identify files.
    The "file" command. It does not rely upon, or use, the extension.

    And those people are wizards. You can't expect to just read their
    source and make some instant discovery. Sometimes, when someone
    asks for a new detection, the wizards know of some dependencies
    in the detection tree that prevent the craftsmanship necessary.
    Mere mortals need not apply while this is going on.

    To find 100 different text file types, I un-tarred the Firefox
    source tarball and scanned it, then used AWK to total the
    various detections and print them out. I only used the AWK
    code, after being shocked to find what a shithole the tarball was.
    I had originally intended to run UNIX2DOS over the thing, but
    that was entirely out of the question when the detections
    came in. In fact, there is just one source file in the Firefox
    tree, that you MUST NOT alter. It breaks the build, if you do
    ANYTHING to it. Good times. I could not figure out why gcc
    had such a problem with the file. Could not root cause it.

    *******

    As a little example, I will scan the Sent file of my News Client,
    which I happen to know is corrupted, but I haven't bothered to
    fix it yet. And how I detected the corruption in the first place,
    was by running this!

    $ File Sent
    Sent: Non-ISO extended-ASCII text, with very long lines, with CRLF, NEL line terminators

    That is a corrupt one.

    $ File Trash
    Trash: ASCII text, with CRLF line terminators

    That is not corrupt.

    $ dd if=/dev/urandom of=big.bin bs=1048576 count=1024
    1024+0 records in
    1024+0 records out
    1073741824 bytes (1.1 GB, 1.0 GiB) copied, 7.44362 s, 144 MB/s

    $ file big.bin
    big.bin: data <=== Not definitive, as even trivially distorted files do this.
    This file just happens to be "perfectly undetectable".

    A file full of zeros, is also "data". There is no special detection for it.

    Paul
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From bart@bc@freeuk.com to comp.lang.c on Sat Dec 6 12:42:53 2025
    From Newsgroup: comp.lang.c

    On 06/12/2025 02:42, Kaz Kylheku wrote:
    On 2025-12-06, Michael Sanders <porkchop@invalid.foo> wrote:
    Am I close? Missing anything you'd consider to be (or not) needed?

    <stdio.h>

    /*
    * Checks if a file is likely a binary by examining its content
    * for NULL bytes (0x00) or unusual control characters.
    * Returns 0 if text, 1 if binary or file open failure.
    */

    int is_binary_file(const char *path) {

    [ ... ]

    fclose(f);
    return 0; // NOT binary
    }

    How about:

    int is_binary_file(const char *path)
    {
    FILE *f = fopen(path);
    int yes = 0;

    if (f) {
    int ch;

    while ((ch == getc(f)) != EOF) {
    for (int i = 0; i < CHAR_BIT; i++, ch >>= 1) {
    switch ((ch & 1)) {
    case 0:
    case 1:
    break;
    default:

    If this is suppposed to detect files which don't consist of binary
    characters (for example each ch has CHAR_BIT quaternary digits) then I
    don't believe this will detect that.

    Assumung that 'ch & 1' is equivalent to 'ch % 2' in that case.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.lang.c on Sat Dec 6 17:33:08 2025
    From Newsgroup: comp.lang.c

    Michael Sanders <porkchop@invalid.foo> writes:
    Am I close? Missing anything you'd consider to be (or not) needed?

    Technically, there is no such thing as a "binary" file. All files
    are simply sequences of bytes with no format implied. Interpretation
    of the file content is purely application dependent.

    C-based applications have certain restrictions on text format
    due to the use of the ASCII NUL code as a string terminator, but
    that's C. The content of a text file processed by a different
    language, or by C using application-defined string containers
    can easily contain a NUL byte yet still be considered "text"
    if that distinction is necessary.

    Because of C/C++, a valid UTF-8 encoding will not include
    the NUL byte.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.lang.c on Sat Dec 6 17:37:11 2025
    From Newsgroup: comp.lang.c

    Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:
    Michael Sanders <porkchop@invalid.foo> writes:
    Am I close? Missing anything you'd consider to be (or not) needed?

    There is no completely reliable way to do this, but you might be
    able to make a reasonable guess. A binary file might happen to
    contain only byte values that represent printable characters.

    <stdio.h>

    /*
    * Checks if a file is likely a binary by examining its content
    * for NULL bytes (0x00) or unusual control characters.
    * Returns 0 if text, 1 if binary or file open failure.
    */

    Please use the term "null bytes", not "NULL bytes". NULL is a standard
    macro that expands to a null pointer constant.

    The proper term IMO is 'NUL' byte as defined by ASCII.

    Some older operating systems actually stored the file type in
    metadata (like the unix inode). The Burroughs MCP filesystems
    included a file-type field in the metadata for a file; the CANDE editor
    would use this to determine the programming language (and the associated language formatting rules a la COBOL or FORTRAN vis-a-vis column
    assignments for the sequence number, program verbs, etc.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.lang.c on Sat Dec 6 17:40:18 2025
    From Newsgroup: comp.lang.c

    Kaz Kylheku <046-301-5902@kylheku.com> writes:
    On 2025-12-06, Michael Sanders <porkchop@invalid.foo> wrote:
    Am I close? Missing anything you'd consider to be (or not) needed?

    <stdio.h>

    /*
    * Checks if a file is likely a binary by examining its content
    * for NULL bytes (0x00) or unusual control characters.
    * Returns 0 if text, 1 if binary or file open failure.
    */

    int is_binary_file(const char *path) {

    [ ... ]

    fclose(f);
    return 0; // NOT binary
    }

    How about:

    int is_binary_file(const char *path)
    {
    FILE *f = fopen(path);

    if (f) {

    while (isprint(getc(f)) {}
    return (!feof(f));

    }
    return 0;
    }
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lew Pitcher@lew.pitcher@digitalfreehold.ca to comp.lang.c on Sat Dec 6 18:04:00 2025
    From Newsgroup: comp.lang.c

    On Sat, 06 Dec 2025 17:40:18 +0000, Scott Lurndal wrote:

    Kaz Kylheku <046-301-5902@kylheku.com> writes:
    On 2025-12-06, Michael Sanders <porkchop@invalid.foo> wrote:
    Am I close? Missing anything you'd consider to be (or not) needed?

    <stdio.h>

    /*
    * Checks if a file is likely a binary by examining its content
    * for NULL bytes (0x00) or unusual control characters.
    * Returns 0 if text, 1 if binary or file open failure.
    */

    int is_binary_file(const char *path) {

    [ ... ]

    fclose(f);
    return 0; // NOT binary
    }

    How about:

    int is_binary_file(const char *path)
    {
    FILE *f = fopen(path);

    if (f) {

    while (isprint(getc(f)) {}

    The isprint function tests for any member of a locale-specific
    set of characters (each of which occupies one printing position
    on a display device) including space (' ').

    It effectively evaluates whether or not a given value is a
    "printing character" in the execution characterset, not whether
    or not a given value (from an outside file) is a text character.

    I'd use this function cautiously, as it will produce false
    results when the characterset of the source data is not the the
    execution characterset (think a Unicode UTF16 encoded text
    file, and an ASCII execution characterset).

    return (!feof(f));

    }
    return 0;
    }
    --
    Lew Pitcher
    "In Skills We Trust"
    Not LLM output - I'm just like this.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.lang.c on Sat Dec 6 19:06:14 2025
    From Newsgroup: comp.lang.c

    Lew Pitcher <lew.pitcher@digitalfreehold.ca> writes:
    On Sat, 06 Dec 2025 17:40:18 +0000, Scott Lurndal wrote:

    Kaz Kylheku <046-301-5902@kylheku.com> writes:
    On 2025-12-06, Michael Sanders <porkchop@invalid.foo> wrote:
    Am I close? Missing anything you'd consider to be (or not) needed?

    <stdio.h>

    /*
    * Checks if a file is likely a binary by examining its content
    * for NULL bytes (0x00) or unusual control characters.
    * Returns 0 if text, 1 if binary or file open failure.
    */

    int is_binary_file(const char *path) {

    [ ... ]

    fclose(f);
    return 0; // NOT binary
    }

    How about:

    int is_binary_file(const char *path)
    {
    FILE *f = fopen(path);

    if (f) {

    while (isprint(getc(f)) {}

    The isprint function tests for any member of a locale-specific
    set of characters (each of which occupies one printing position
    on a display device) including space (' ').

    It effectively evaluates whether or not a given value is a
    "printing character" in the execution characterset, not whether
    or not a given value (from an outside file) is a text character.

    What is your definition of a "text" character?


    I'd use this function cautiously, as it will produce false
    results when the characterset of the source data is not the the
    execution characterset (think a Unicode UTF16 encoded text
    file, and an ASCII execution characterset).

    return (!feof(f));

    }
    return 0;
    }




    --
    Lew Pitcher
    "In Skills We Trust"
    Not LLM output - I'm just like this.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lew Pitcher@lew.pitcher@digitalfreehold.ca to comp.lang.c on Sat Dec 6 21:16:02 2025
    From Newsgroup: comp.lang.c

    On Sat, 06 Dec 2025 19:06:14 +0000, Scott Lurndal wrote:

    Lew Pitcher <lew.pitcher@digitalfreehold.ca> writes:
    On Sat, 06 Dec 2025 17:40:18 +0000, Scott Lurndal wrote:

    Kaz Kylheku <046-301-5902@kylheku.com> writes:
    On 2025-12-06, Michael Sanders <porkchop@invalid.foo> wrote:
    Am I close? Missing anything you'd consider to be (or not) needed?

    <stdio.h>

    /*
    * Checks if a file is likely a binary by examining its content
    * for NULL bytes (0x00) or unusual control characters.
    * Returns 0 if text, 1 if binary or file open failure.
    */

    int is_binary_file(const char *path) {

    [ ... ]

    fclose(f);
    return 0; // NOT binary
    }

    How about:

    int is_binary_file(const char *path)
    {
    FILE *f = fopen(path);

    if (f) {

    while (isprint(getc(f)) {}

    The isprint function tests for any member of a locale-specific
    set of characters (each of which occupies one printing position
    on a display device) including space (' ').

    It effectively evaluates whether or not a given value is a
    "printing character" in the execution characterset, not whether
    or not a given value (from an outside file) is a text character.

    What is your definition of a "text" character?

    I have none, for this case. However, the OP /might/ have one, given
    that his code was an attempt to discern "text" files from "binary"
    files.


    I'd use this function cautiously, as it will produce false
    results when the characterset of the source data is not the the
    execution characterset (think a Unicode UTF16 encoded text
    file, and an ASCII execution characterset).

    return (!feof(f));

    }
    return 0;
    }
    --
    Lew Pitcher
    "In Skills We Trust"
    Not LLM output - I'm just like this.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Keith Thompson@Keith.S.Thompson+u@gmail.com to comp.lang.c on Sat Dec 6 16:05:45 2025
    From Newsgroup: comp.lang.c

    scott@slp53.sl.home (Scott Lurndal) writes:
    Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:
    [...]
    Please use the term "null bytes", not "NULL bytes". NULL is a standard >>macro that expands to a null pointer constant.

    The proper term IMO is 'NUL' byte as defined by ASCII.

    That's *a* proper term. It's not the only one.

    Both ASCII and EBCDIC use the term "NUL" for the character value
    with all bits set to zero, but C doesn't assume either ASCII or
    EBCDIC and doesn't use the name "NUL". The standard uses the term
    "null character", which is technically correct but might not be
    ideal to refer to a byte in a file whose contents aren't intended
    to represent characters.

    I have no problem with the term "NUL", "NUL byte", or "NUL
    character", but personally I tend to prefer "null byte", "zero byte",
    or '\0'.

    [...]
    --
    Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
    void Void(void) { Void(); } /* The recursive call of the void */
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From James Kuyper@jameskuyper@alumni.caltech.edu to comp.lang.c on Sat Dec 6 20:37:22 2025
    From Newsgroup: comp.lang.c

    On 2025-12-05 20:05, Michael Sanders wrote:
    Am I close? Missing anything you'd consider to be (or not) needed?

    <stdio.h>

    /*
    * Checks if a file is likely a binary by examining its content
    * for NULL bytes (0x00) or unusual control characters.

    NULL is a macro that expands to a null pointer constant. I think you
    mean "null character". This isn't just nit-picking -C is a
    case-sensitive language, so it's essential to pay attention to case.

    * Returns 0 if text, 1 if binary or file open failure.
    */

    You should return a distinct value for file open failure - a file that
    cannot be opened cannot be determined to be either a text or a binary file.

    You really cannot distinguish with certainty whether a file is a text
    file or a binary file based solely upon the contents. A file whose
    format is an array of two-byte 2's complement little-endian integers
    would normally be considered binary, yet it might happen to contain
    integers whose bytes all happen to be printable characters.

    The standard does not define what a "binary file" is. However, it does
    provide a promise that applies only to streams in text mode, which
    depends upon what was written to that file:

    "Data read in from a text stream will necessarily compare equal to the
    data that were earlier written out to that stream only if: the data
    consist only of printing characters and the control characters
    horizontal tab and new-line; no new-line character is immediately
    preceded by space characters; and the last character is a new-line
    character." (7.23.2p2).

    I believe it therefore makes sense to consider something to be a text
    file if it meets those requirements, and otherwise is a binary file.
    Note that the last requirement implies that an empty file cannot qualify
    as text - at a minimum, it must contain a new-line character.

    This implies the use of the isprint() function; the only other
    characters you need to handle specifically are '\t', '\n', and ' '.
    Since the result returned by isprint() is locale-dependent, the program
    should, at least optionally, use setlocale().
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.lang.c on Sun Dec 7 03:43:58 2025
    From Newsgroup: comp.lang.c

    Michael Sanders <porkchop@invalid.foo> wrote:
    Am I close? Missing anything you'd consider to be (or not) needed?

    You miss definition: you should first decide what you consider to
    be a binary file (this is hard part). You may wish consider
    my experience many years ago: I looked at problem reports about
    SUN OS. Those were considered text files, in total about 160 MB.
    For my purposes it would be convenient to find character code _not_
    appearing in those files. But checking found that the only code
    which did not appear were 0. Report were mostly in English,
    but there were non-English pieces contributing international
    characters. There were handful of box-drawing characters.
    There were (I think stray) control codes.

    You can take from this that zero code was strong indicator of
    non-text file. But do you consider UTF-16 encode text as binary?
    Note that such text is likely to contain a lot of zero bytes.
    Any byte different than zero will appear in a file considered by
    its author to be a text file as long as you take large enough
    sample.

    If you have few hundred of characters from a file you can apply
    a reasonably simple statistical test to decide if text came from
    one of popular human langages and if yes test will tell you the
    language.

    For security puprose you may wish to check if a file oly contains
    safe codes. But definition of "safe" depends on application.
    In US context you could decide that anything outside printable
    ASCII + newline is unsafe. Or you may add to this some selected
    contol codes like tabs. In international context you probably
    need to allow relevant national character codes, which depends
    on specific environment.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Louis Krupp@lkrupp@invalid.pssw.com.invalid to comp.lang.c on Sun Dec 7 03:43:40 2025
    From Newsgroup: comp.lang.c

    On 12/6/2025 10:37 AM, Scott Lurndal wrote:
    <snip>

    Some older operating systems actually stored the file type in
    metadata (like the unix inode). The Burroughs MCP filesystems
    included a file-type field in the metadata for a file; the CANDE editor
    would use this to determine the programming language (and the associated language formatting rules a la COBOL or FORTRAN vis-a-vis column
    assignments for the sequence number, program verbs, etc.

    The Burroughs file attribute name was "FILEKIND," and it took values
    like ALGOLSYMBOL (for an ALGOL source file) and ALGOLCODE (for an
    executable compiled with ALGOL). Other file attributes included maximum
    record length, character encoding (e.g. ASCII or EBCDIC), and lots more.

    This brings back memories, most of them fond.

    As far as I can tell, UNISYS MCP systems still have all that:

    https://public.support.unisys.com/aseries/docs/ClearPath-MCP-19.0/86000064-520/86000064-520/chapter-000002094.html

    Louis


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.lang.c on Sun Dec 7 16:47:19 2025
    From Newsgroup: comp.lang.c

    Louis Krupp <lkrupp@invalid.pssw.com.invalid> writes:
    On 12/6/2025 10:37 AM, Scott Lurndal wrote:
    <snip>

    Some older operating systems actually stored the file type in
    metadata (like the unix inode). The Burroughs MCP filesystems
    included a file-type field in the metadata for a file; the CANDE editor
    would use this to determine the programming language (and the associated
    language formatting rules a la COBOL or FORTRAN vis-a-vis column
    assignments for the sequence number, program verbs, etc.

    The Burroughs file attribute name was "FILEKIND," and it took values
    like ALGOLSYMBOL (for an ALGOL source file) and ALGOLCODE (for an
    executable compiled with ALGOL). Other file attributes included maximum >record length, character encoding (e.g. ASCII or EBCDIC), and lots more.

    This brings back memories, most of them fond.

    As far as I can tell, UNISYS MCP systems still have all that:

    https://public.support.unisys.com/aseries/docs/ClearPath-MCP-19.0/86000064-520/86000064-520/chapter-000002094.html


    Yes the A-series (Large Systems) emulated systems still have
    all that.

    The V-series (long defunct) also supported a file kind attribute
    for CANDE files.

    --------------------------------------------------------------------------------
    CAT
    C A T A L O G
    Usercode: 9895 Filetitle: ====qn on HOME As of 12/07/25 08:35:22 Pg 01

    gemcqn SYS Record-size = 600 RPB = 1 Areas 0 EOF 716 LOCALSPO
    w15eqn SYS Record-size = 160 RPB = 90 Areas 0 EOF 322 LURNDAL AAAAqn BPL 09/28/89 10:18:13 5Rec(s) Pub IO 9895 ADDMqn BPL 04/14/87 19:46:19 10Rec(s) Pri IO 9895 ADDUqn BPL 03/03/89 17:04:42 21Rec(s) Pub IO 9895 ADSSqn BPL 06/28/89 14:15:24 7999Rec(s) Pub IO 9895 AHWAqn SPRITE 10/09/89 17:53:23 23Rec(s) Pub IO 9895 AIFAqn BPL 10/12/89 15:15:29 8Rec(s) Pub IO 9895 AIVAqn BPL 10/20/89 16:21:33 4Rec(s) Pub IO 9895 APBPqn BPL 01/11/89 18:05:38 92Rec(s) Pub IO 9895 ARCVqn BPL 04/10/89 10:47:00 376Rec(s) Grd IO 9895 BACKqn BPL 09/09/89 03:55:03 576Rec(s) Pub IO 9895 BBBBqn SPRITE 01/25/89 12:47:45 1Rec(s) Pub IO 9895 BFILqn BINDER 07/05/88 16:27:44 9Rec(s) Pub IO 9895 BLESqn BPL 11/18/88 14:37:32 34Rec(s) Pub IO 9895 BLOAqn BINDER 08/01/87 16:02:54 49Rec(s) Pub IO 9895 BNAGqn DATA 02/08/88 15:07:06 41Rec(s) Pub IO 9999 BNAUqn BPL 06/06/88 15:02:14 104Rec(s) Pri IO 9895 BNAVqn DATA 03/03/89 16:15:00 58Rec(s) Pri IO 9895 BSKLqn BPL 08/11/89 14:24:18 536Rec(s) Pub IO 9895 Transmit space for next page..

    BPL - Burroughs Programming Language (low-level systems programming)
    SPRITE - Modula-like OS implementation language
    BINDER - linker instructions.


    (ADSSqn is the BPL source for the document formatting utility)

    four-letter file names were a bit of a pain (the system had
    six character names, but the last two characters for CANDE
    stored the usercode (9895 in EBCDIC is 'qn').

    I wrote the MCP system intialization code; the source file
    was SINSqn, the printer banner name was SINEqn. A colleague pointed
    out that could be read as sine non qua which seemed quite
    apropo for the system boot code :-).
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Bonita Montero@Bonita.Montero@gmail.com to comp.lang.c on Sun Dec 7 19:04:51 2025
    From Newsgroup: comp.lang.c

    Am 06.12.2025 um 18:33 schrieb Scott Lurndal:
    Michael Sanders <porkchop@invalid.foo> writes:
    Am I close? Missing anything you'd consider to be (or not) needed?
    Technically, there is no such thing as a "binary" file. All files
    are simply sequences of bytes with no format implied. Interpretation
    of the file content is purely application dependent.

    C-based applications have certain restrictions on text format
    due to the use of the ASCII NUL code as a string terminator, but
    that's C. The content of a text file processed by a different
    language, or by C using application-defined string containers
    can easily contain a NUL byte yet still be considered "text"
    if that distinction is necessary.

    Because of C/C++, a valid UTF-8 encoding will not include
    the NUL byte.

    You're a philosopher of language because you can't handle ambiguity. But
    C is ambiguous at this point.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Richard Harnden@richard.nospam@gmail.invalid to comp.lang.c on Sun Dec 7 19:01:02 2025
    From Newsgroup: comp.lang.c

    On 06/12/2025 01:05, Michael Sanders wrote:
    Am I close? Missing anything you'd consider to be (or not) needed?

    A text file is supposed to end with a '\n' (M$, of course, largely
    ignores this convention), but a quick test could be:

    f = fopen(path, "rb");

    fseek(f, -1, SEEK_END);

    if ( (c = fgetc(f)) == '\n' )
    printf("Text\n");
    else
    printf("Binary\n");

    fclose(f);

    Be aware of false positives/negatives, because I'm sure there will be
    plenty :)



    <stdio.h>

    /*
    * Checks if a file is likely a binary by examining its content
    * for NULL bytes (0x00) or unusual control characters.
    * Returns 0 if text, 1 if binary or file open failure.
    */

    int is_binary_file(const char *path) {
    FILE *f = fopen(path, "rb");
    if (!f) return 1; // cannot open file, treat as error/fail check

    unsigned char buf[65536];
    size_t n, i;

    while ((n = fread(buf, 1, sizeof(buf), f)) > 0) {
    for (i = 0; i < n; i++) {
    unsigned char c = buf[i];

    // 1. check for the NULL byte (strong indicator of binary data)
    if (c == 0x00) {
    fclose(f);
    return 1; // IS binary
    }

    // 2. check for C0 control codes (0x01-0x1F), excluding known
    // text formatting characters: 0x09 (Tab), 0x0A (LF), 0x0D (CR)
    if (c < 0x20) {
    if (c != 0x09 && c != 0x0A && c != 0x0D) {
    fclose(f);
    return 1; // IS binary (contains unexpected control code)
    }
    }
    }
    }

    fclose(f);
    return 0; // NOT binary
    }


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Richard Heathfield@rjh@cpax.org.uk to comp.lang.c on Sun Dec 7 21:51:36 2025
    From Newsgroup: comp.lang.c

    On 07/12/2025 19:01, Richard Harnden wrote:
    On 06/12/2025 01:05, Michael Sanders wrote:
    Am I close? Missing anything you'd consider to be (or not)
    needed?

    A text file is supposed to end with a '\n' (M$, of course,
    largely ignores this convention), but a quick test could be:

    f = fopen(path, "rb");

    fseek(f, -1, SEEK_END);

    Not guaranteed to work with binary files...

    7.19.9.2(3)

    A binary stream need not meaningfully support fseek calls with a
    whence value of SEEK_END.

    ...or text files.

    7.19.9.2(4)

    For a text stream, either offset shall be zero, or offset shall
    be a value returned by an earlier successful call to the ftell
    function on a stream associated with the same file and whence
    shall be SEEK_SET.
    --
    Richard Heathfield
    Email: rjh at cpax dot org dot uk
    "Usenet is a strange place" - dmr 29 July 1999
    Sig line 4 vacant - apply within
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.lang.c on Sun Dec 7 14:42:39 2025
    From Newsgroup: comp.lang.c

    On 12/5/2025 5:05 PM, Michael Sanders wrote:
    int is_binary_file(const char *path) {

    [...]

    You can return a float from is_binary_file() to show a probability? Not exactly sure how you can 100% guarantee it...
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Richard Harnden@richard.nospam@gmail.invalid to comp.lang.c on Sun Dec 7 22:49:52 2025
    From Newsgroup: comp.lang.c

    On 07/12/2025 21:51, Richard Heathfield wrote:
    On 07/12/2025 19:01, Richard Harnden wrote:
    On 06/12/2025 01:05, Michael Sanders wrote:
    Am I close? Missing anything you'd consider to be (or not)
    needed?

    A text file is supposed to end with a '\n' (M$, of course, largely
    ignores this convention), but a quick test could be:

    f = fopen(path, "rb");

    fseek(f, -1, SEEK_END);

    Not guaranteed to work with binary files...

    7.19.9.2(3)

    A binary stream need not meaningfully support fseek calls with a whence value of SEEK_END.

    ...or text files.

    7.19.9.2(4)

    For a text stream, either offset shall be zero, or offset shall
    be a value returned by an earlier successful call to the ftell function
    on a stream associated with the same file and whence shall be SEEK_SET.


    Ah, okay. Thanks.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Bonita Montero@Bonita.Montero@gmail.com to comp.lang.c on Mon Dec 8 13:51:49 2025
    From Newsgroup: comp.lang.c

    Am 07.12.2025 um 22:51 schrieb Richard Heathfield:
    On 07/12/2025 19:01, Richard Harnden wrote:
    On 06/12/2025 01:05, Michael Sanders wrote:
    Am I close? Missing anything you'd consider to be (or not)
    needed?

    A text file is supposed to end with a '\n' (M$, of course, largely
    ignores this convention), but a quick test could be:

    f = fopen(path, "rb");

    fseek(f, -1, SEEK_END);

    Not guaranteed to work with binary files...

    7.19.9.2(3)

    A binary stream need not meaningfully support fseek calls with a
    whence value of SEEK_END.

    From the glibc Reference Manual:

    “The distinction between text and binary streams is only meaningful on systems where text files
    have a different internal representation. On Unix systems, there is no difference between the
    two; the ‘b’ is accepted but ignored.”



    ...or text files.

    7.19.9.2(4)

    For a text stream, either offset shall be zero, or offset shall
    be a value returned by an earlier successful call to the ftell
    function on a stream associated with the same file and whence shall be SEEK_SET.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.lang.c on Mon Dec 8 16:02:51 2025
    From Newsgroup: comp.lang.c

    Richard Heathfield <rjh@cpax.org.uk> writes:
    On 07/12/2025 19:01, Richard Harnden wrote:
    On 06/12/2025 01:05, Michael Sanders wrote:
    Am I close? Missing anything you'd consider to be (or not)
    needed?

    A text file is supposed to end with a '\n' (M$, of course,
    largely ignores this convention), but a quick test could be:

    f = fopen(path, "rb");

    fseek(f, -1, SEEK_END);

    Not guaranteed to work with binary files...

    7.19.9.2(3)

    A binary stream need not meaningfully support fseek calls with a
    whence value of SEEK_END.

    Not to mention that the ASCII LF character _is_ a valid binary
    character, so the presence or absence of an LF as the last byte of a file doesn't indicate anything useful.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.lang.c on Mon Dec 8 16:04:11 2025
    From Newsgroup: comp.lang.c

    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 07.12.2025 um 22:51 schrieb Richard Heathfield:
    On 07/12/2025 19:01, Richard Harnden wrote:
    On 06/12/2025 01:05, Michael Sanders wrote:
    Am I close? Missing anything you'd consider to be (or not)
    needed?

    A text file is supposed to end with a '\n' (M$, of course, largely
    ignores this convention), but a quick test could be:

    f = fopen(path, "rb");

    fseek(f, -1, SEEK_END);

    Not guaranteed to work with binary files...

    7.19.9.2(3)

    A binary stream need not meaningfully support fseek calls with a
    whence value of SEEK_END.

    From the glibc Reference Manual:

    Has nothing to do with glibc. Dates back to the earliest
    days of unix, and is codified by POSIX/SUS.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Mon Dec 8 17:40:59 2025
    From Newsgroup: comp.lang.c

    On Sat, 6 Dec 2025 02:00:22 -0000 (UTC), Lew Pitcher wrote:

    HTH

    Yes sir it really does. I'll study your post closely &
    dont think because my reply is brief that I'm not
    considering your words.

    Thank you Lew.
    --
    :wq
    Mike Sanders
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Mon Dec 8 17:46:22 2025
    From Newsgroup: comp.lang.c

    On Fri, 05 Dec 2025 17:42:30 -0800, Keith Thompson wrote:

    There is no completely reliable way to do this, but you might be
    able to make a reasonable guess. A binary file might happen to
    contain only byte values that represent printable characters.

    I suspected this was going to be the case actually.

    Please use the term "null bytes", not "NULL bytes". NULL is a standard
    macro that expands to a null pointer constant.

    Okay, will do.

    It seems odd to say that a file is assumed to be binary if you can't
    open it. I suggest having the function return more than two distinct
    values:

    - File seems to be binary
    - File seems to be text
    - Could be either
    - Something went wrong

    An enum is probably a good choice.

    Aye, that's an interesting way to look at it.

    0x00 -> '\0'
    0x20 -> ' '
    0x09 -> '\t'
    0x0A -> '\n'
    0x0D -> '\r'

    Well, I got too fancy there...

    Depending on how far you want to get into it, distinguishing between
    text and binary files is anywhere from difficult to literally
    impossible.

    Thanks for your expertise Keith, I appreciate your insight.
    --
    :wq
    Mike Sanders
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Mon Dec 8 17:48:13 2025
    From Newsgroup: comp.lang.c

    On Sat, 6 Dec 2025 02:42:39 -0000 (UTC), Kaz Kylheku wrote:

    How about:

    [...]

    You sir are an OCD coder =)
    --
    :wq
    Mike Sanders
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Mon Dec 8 17:56:26 2025
    From Newsgroup: comp.lang.c

    On Sat, 6 Dec 2025 03:14:55 -0500, Paul wrote:

    It is the year 2025.

    How many times do you suppose someone has considered this question ?

    I'm not trying to be a smart ass by saying this, just that the
    question is bound to be nuanced. You can do a fast and totally
    inaccurate determination. You can do a computationally expensive
    or I/O expensive determination.

    I get it Paul, but as with all things, there's lots of opinions on this.

    There has to be a reason for doing this, and a damn good reason.

    *******

    There is the "file" command.

    It was invented in 1973.

    https://en.wikipedia.org/wiki/File_%28command%29

    The beauty of this command, is it has some sort of ordered
    approach to file determination.

    And... is not generally available on Windows & causes a 3rd party
    dependency. Not to say that you're not correct in your thinking
    but I want portability. And there are lots of things I want that
    dont always happen either...

    [...]

    Thanks Paul, actually I do appreciate your rant & the detailed examples
    you cite. I'm in the same place with my project, it can be very frustrating.
    --
    :wq
    Mike Sanders
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Mon Dec 8 18:02:26 2025
    From Newsgroup: comp.lang.c

    On Sat, 6 Dec 2025 20:37:22 -0500, James Kuyper wrote:

    NULL is a macro that expands to a null pointer constant. I think you
    mean "null character". This isn't just nit-picking -C is a
    case-sensitive language, so it's essential to pay attention to case.

    Of yeah. I'm at the stage of simultaneously getting a lot wrong,
    a lot right, & that makes my code dangerous at times. I'm slowly
    getting there.

    You should return a distinct value for file open failure - a file that
    cannot be opened cannot be determined to be either a text or a binary file.

    Noted.

    You really cannot distinguish with certainty whether a file is a text
    file or a binary file based solely upon the contents. A file whose
    format is an array of two-byte 2's complement little-endian integers
    would normally be considered binary, yet it might happen to contain
    integers whose bytes all happen to be printable characters.

    Ah, I want it to be simple, but that's not the case.

    This implies the use of the isprint() function; the only other
    characters you need to handle specifically are '\t', '\n', and ' '.
    Since the result returned by isprint() is locale-dependent, the program should, at least optionally, use setlocale().

    Hmm, now that's a curve-ball I did not see coming! I've got to think
    about this...

    Paul, thank you for sharing your knowledge, I appreciate your help sir.
    --
    :wq
    Mike Sanders
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Mon Dec 8 18:04:51 2025
    From Newsgroup: comp.lang.c

    On Sun, 7 Dec 2025 03:43:58 -0000 (UTC), Waldek Hebisch wrote:

    You miss definition: you should first decide what you consider to
    be a binary file (this is hard part).

    Yes. This is it - everything right here Waldek, that is my entire
    problem.

    Thank you for you post, it is interesting reading.
    --
    :wq
    Mike Sanders
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Mon Dec 8 18:07:26 2025
    From Newsgroup: comp.lang.c

    On Sun, 7 Dec 2025 19:01:02 +0000, Richard Harnden wrote:

    A text file is supposed to end with a '\n' (M$, of course, largely
    ignores this convention), but a quick test could be:

    f = fopen(path, "rb");

    fseek(f, -1, SEEK_END);

    if ( (c = fgetc(f)) == '\n' )
    printf("Text\n");
    else
    printf("Binary\n");

    fclose(f);

    Be aware of false positives/negatives, because I'm sure there will be
    plenty :)

    Thank you Richard. Interesting thoughts.
    --
    :wq
    Mike Sanders
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Mon Dec 8 18:09:08 2025
    From Newsgroup: comp.lang.c

    On Sun, 7 Dec 2025 14:42:39 -0800, Chris M. Thomasson wrote:

    You can return a float from is_binary_file() to show a probability? Not exactly sure how you can 100% guarantee it...

    Ha!

    You know, that's a crazy idea but a darn cool idea at the same time!
    --
    :wq
    Mike Sanders
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Bonita Montero@Bonita.Montero@gmail.com to comp.lang.c on Mon Dec 8 19:27:25 2025
    From Newsgroup: comp.lang.c

    Am 08.12.2025 um 17:04 schrieb Scott Lurndal:
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 07.12.2025 um 22:51 schrieb Richard Heathfield:
    On 07/12/2025 19:01, Richard Harnden wrote:
    On 06/12/2025 01:05, Michael Sanders wrote:
    Am I close? Missing anything you'd consider to be (or not)
    needed?
    A text file is supposed to end with a '\n' (M$, of course, largely
    ignores this convention), but a quick test could be:

    f = fopen(path, "rb");

    fseek(f, -1, SEEK_END);
    Not guaranteed to work with binary files...

    7.19.9.2(3)

    A binary stream need not meaningfully support fseek calls with a
    whence value of SEEK_END.
    From the glibc Reference Manual:
    Has nothing to do with glibc. Dates back to the earliest
    days of unix, and is codified by POSIX/SUS.

    Where did I say that this is tue for glibc only ?

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From bart@bc@freeuk.com to comp.lang.c on Mon Dec 8 18:44:33 2025
    From Newsgroup: comp.lang.c

    On 08/12/2025 18:04, Michael Sanders wrote:
    On Sun, 7 Dec 2025 03:43:58 -0000 (UTC), Waldek Hebisch wrote:

    You miss definition: you should first decide what you consider to
    be a binary file (this is hard part).

    Yes. This is it - everything right here Waldek, that is my entire
    problem.

    It's not clear what the actual problem is. What is the use-case for a
    function that tells you whether any file /might/ be a text-file based on speculative analysis of its contents?

    Is the result /meant/ to be fuzzy?
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Kaz Kylheku@046-301-5902@kylheku.com to comp.lang.c on Mon Dec 8 19:26:07 2025
    From Newsgroup: comp.lang.c

    On 2025-12-08, Michael Sanders <porkchop@invalid.foo> wrote:
    On Sat, 6 Dec 2025 02:42:39 -0000 (UTC), Kaz Kylheku wrote:

    How about:

    [...]

    You sir are an OCD coder =)

    At last, someone seems to have gotten the joke.
    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Bonita Montero@Bonita.Montero@gmail.com to comp.lang.c on Mon Dec 8 20:36:17 2025
    From Newsgroup: comp.lang.c

    Am 06.12.2025 um 02:05 schrieb Michael Sanders:
    Am I close? Missing anything you'd consider to be (or not) needed?

    <stdio.h>

    /*
    * Checks if a file is likely a binary by examining its content
    * for NULL bytes (0x00) or unusual control characters.
    * Returns 0 if text, 1 if binary or file open failure.
    */

    int is_binary_file(const char *path) {
    FILE *f = fopen(path, "rb");
    if (!f) return 1; // cannot open file, treat as error/fail check

    unsigned char buf[65536];
    size_t n, i;

    while ((n = fread(buf, 1, sizeof(buf), f)) > 0) {
    for (i = 0; i < n; i++) {
    unsigned char c = buf[i];

    // 1. check for the NULL byte (strong indicator of binary data)
    if (c == 0x00) {
    fclose(f);
    return 1; // IS binary
    }

    // 2. check for C0 control codes (0x01-0x1F), excluding known
    // text formatting characters: 0x09 (Tab), 0x0A (LF), 0x0D (CR)
    if (c < 0x20) {
    if (c != 0x09 && c != 0x0A && c != 0x0D) {
    fclose(f);
    return 1; // IS binary (contains unexpected control code)
    }
    }
    }
    }

    fclose(f);
    return 0; // NOT binary
    }

    Much smaller and with error handling for free:

    bool binary( path pth )
    {
        ifstream ifs;
        ifs.exceptions( ios_base::badbit );
        ifs.open( pth, ios_base::binary | ios_base::ate );
        streampos pos = ifs.tellg();
        if( pos > (size_t)-1 ) // for 32 bit platforms with large files
            throw ios_base::failure( "file too large", error_code( (int)errc::file_too_large, generic_category() ) );
        string buf( (size_t)pos, 0 );
        ifs.seekg( 0 );
        ifs.read( buf.data(), buf.size() );
        auto check = []( unsigned char c ) { return c < 0x20 && c != '\r'
    && c != '\n' && c != '\t'; };
        return find_if( buf.begin(), buf.end(), check ) == buf.end();
    }

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Richard Heathfield@rjh@cpax.org.uk to comp.lang.c on Mon Dec 8 19:42:47 2025
    From Newsgroup: comp.lang.c

    On 08/12/2025 19:26, Kaz Kylheku wrote:
    On 2025-12-08, Michael Sanders <porkchop@invalid.foo> wrote:
    On Sat, 6 Dec 2025 02:42:39 -0000 (UTC), Kaz Kylheku wrote:

    How about:

    [...]

    You sir are an OCD coder =)

    At last, someone seems to have gotten the joke.

    An OCD coder would have remembered that fopen takes two
    parameters. :-o
    --
    Richard Heathfield
    Email: rjh at cpax dot org dot uk
    "Usenet is a strange place" - dmr 29 July 1999
    Sig line 4 vacant - apply within
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Bonita Montero@Bonita.Montero@gmail.com to comp.lang.c on Mon Dec 8 20:50:06 2025
    From Newsgroup: comp.lang.c

    And if you like it fast:

    bool binary( path pth )
    {

        static vector<bool> valid = []()
            {
                vector<bool> ret( numeric_limits<unsigned char>::max() );
                for( size_t c = ret.size(); c--; )
                    ret[c] = c >= 0x20 || c == '\r' || c == '\n' || c == '\t';
                return ret;
            }();
        ifstream ifs;
        ifs.exceptions( ios_base::failbit | ios_base::badbit );
        ifs.open( pth, ios_base::binary | ios_base::ate );
        streampos pos = ifs.tellg();
        if( pos > (size_t)-1 )
            throw ios_base::failure( "file too large", error_code( (int)errc::file_too_large, generic_category() ) );
        string buf( (size_t)pos, 0 );
        ifs.seekg( 0 );
        ifs.read( buf.data(), buf.size() );
        return find_if( buf.begin(), buf.end(), []( unsigned char c ) {
    return !valid[c]; } ) == buf.end();
    }

    The cool thing about that is that the array valid is initialized only
    once and threads-sfe.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.lang.c on Mon Dec 8 20:16:52 2025
    From Newsgroup: comp.lang.c

    Michael Sanders <porkchop@invalid.foo> writes:
    On Sat, 6 Dec 2025 03:14:55 -0500, Paul wrote:

    It is the year 2025.

    How many times do you suppose someone has considered this question ?

    I'm not trying to be a smart ass by saying this, just that the
    question is bound to be nuanced. You can do a fast and totally
    inaccurate determination. You can do a computationally expensive
    or I/O expensive determination.

    I get it Paul, but as with all things, there's lots of opinions on this.

    There has to be a reason for doing this, and a damn good reason.

    *******

    There is the "file" command.

    It was invented in 1973.

    https://en.wikipedia.org/wiki/File_%28command%29

    The beauty of this command, is it has some sort of ordered
    approach to file determination.

    And... is not generally available on Windows

    It is open source and could be built for windows.

    It's also included in any linux distribution running
    under WSL.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Keith Thompson@Keith.S.Thompson+u@gmail.com to comp.lang.c on Mon Dec 8 14:43:58 2025
    From Newsgroup: comp.lang.c

    Michael Sanders <porkchop@invalid.foo> writes:
    [...]

    For yet another set of unreliable hueristics for guessing whether a file
    is text or binary, you can take a look at Perl's built-in "-T" and "-B" operators.

    The "-T" and "-B" tests work as follows. The first block
    or so of the file is examined to see if it is valid
    UTF-8 that includes non-ASCII characters. If so, it's a
    "-T" file. Otherwise, that same portion of the file is
    examined for odd characters such as strange control codes
    or characters with the high bit set. If more than a third
    of the characters are strange, it's a "-B" file; otherwise
    it's a "-T" file. Also, any file containing a zero byte
    in the examined portion is considered a binary file. (If
    executed within the scope of a use locale which includes
    "LC_CTYPE", odd characters are anything that isn't a
    printable nor space in the current locale.) If "-T" or
    "-B" is used on a filehandle, the current IO buffer is
    examined rather than the first block. Both "-T" and "-B"
    return true on an empty file, or a file at EOF when testing
    a filehandle. Because you have to read a file to do the "-T"
    test, on most occasions you want to use a "-f" against the
    file first, as in "next unless -f $file && -T $file".

    It's not clear how big a "block" is. For an empty file, both -T
    and -B are true. I don't know whether there are other cases where
    both are true, or where both are false.
    --
    Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
    void Void(void) { Void(); } /* The recursive call of the void */
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.lang.c on Tue Dec 9 09:03:36 2025
    From Newsgroup: comp.lang.c

    On 08/12/2025 21:16, Scott Lurndal wrote:
    Michael Sanders <porkchop@invalid.foo> writes:
    On Sat, 6 Dec 2025 03:14:55 -0500, Paul wrote:

    It is the year 2025.

    How many times do you suppose someone has considered this question ?

    I'm not trying to be a smart ass by saying this, just that the
    question is bound to be nuanced. You can do a fast and totally
    inaccurate determination. You can do a computationally expensive
    or I/O expensive determination.

    I get it Paul, but as with all things, there's lots of opinions on this.

    There has to be a reason for doing this, and a damn good reason.

    *******

    There is the "file" command.

    It was invented in 1973.

    https://en.wikipedia.org/wiki/File_%28command%29

    The beauty of this command, is it has some sort of ordered
    approach to file determination.

    And... is not generally available on Windows

    It is open source and could be built for windows.

    It's also included in any linux distribution running
    under WSL.


    It is available anywhere you find Windows ports of common *nix
    utilities, such as the msys2 project. (And while an msys2 installation
    can be quite large, it's possible to pull out individual utilities if
    you need to.) Still, it's fair to say that most Windows installations
    don't have it.

    But surely on Windows you can just look at the file extension - if it is ".txt", it's a text file, otherwise it's a binary file.




    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Richard Heathfield@rjh@cpax.org.uk to comp.lang.c on Tue Dec 9 09:43:04 2025
    From Newsgroup: comp.lang.c

    On 09/12/2025 08:03, David Brown wrote:

    <snip>

    But surely on Windows you can just look at the file extension -
    if it is ".txt", it's a text file, otherwise it's a binary file.

    It is now almost a decade since I last made (approximately
    weekly) use of a Windows system. For the 25 years prior to that I
    used a variety of extensions for text filenames, including:

    txt - generic textfile
    doc - documentation*
    c - C source
    cpp - C++ source
    h - C or C++ header
    tex - LaTeX source
    ly - Lilypond source
    eml - email backup
    cfg - configuration files
    ini - initialisation files
    - Makefiles and READMEs
    sh - shell script
    asm - assembly language source
    i - C preprocessor output
    bin - binary (contains only '0', '1', and '\n') - I found less
    than a dozen of these, but there they were.

    These are, of course, all also binary files. Whether a file that
    contains only printable characters is text or binary is really a
    matter of perspective more than anything else.
    --
    Richard Heathfield
    Email: rjh at cpax dot org dot uk
    "Usenet is a strange place" - dmr 29 July 1999
    Sig line 4 vacant - apply within
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Richard Harnden@richard.nospam@gmail.invalid to comp.lang.c on Tue Dec 9 10:17:25 2025
    From Newsgroup: comp.lang.c

    On 09/12/2025 09:43, Richard Heathfield wrote:

    ly  - Lilypond source

    Off topic, but ... Lilypond is a lovely thing :)



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From tTh@tth@none.invalid to comp.lang.c on Tue Dec 9 12:22:21 2025
    From Newsgroup: comp.lang.c

    On 12/9/25 09:03, David Brown wrote:

    But surely on Windows you can just look at the file extension - if it is ".txt", it's a text file, otherwise it's a binary file.

    And what about PNM files who can be pure ascii encoded,
    but was image files ?
    --
    ** **
    * tTh des Bourtoulots *
    * http://maison.tth.netlib.re/ *
    ** **
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Paul@nospam@needed.invalid to comp.lang.c on Tue Dec 9 06:38:47 2025
    From Newsgroup: comp.lang.c

    On Tue, 12/9/2025 3:03 AM, David Brown wrote:
    On 08/12/2025 21:16, Scott Lurndal wrote:
    Michael Sanders <porkchop@invalid.foo> writes:
    On Sat, 6 Dec 2025 03:14:55 -0500, Paul wrote:

    It is the year 2025.

    How many times do you suppose someone has considered this question ?

    I'm not trying to be a smart ass by saying this, just that the
    question is bound to be nuanced. You can do a fast and totally
    inaccurate determination. You can do a computationally expensive
    or I/O expensive determination.

    I get it Paul, but as with all things, there's lots of opinions on this. >>>
    There has to be a reason for doing this, and a damn good reason.

    *******

    There is the "file" command.

    It was invented in 1973.

        https://en.wikipedia.org/wiki/File_%28command%29

    The beauty of this command, is it has some sort of ordered
    approach to file determination.

    And... is not generally available on Windows

    It is open source and could be built for windows.

    It's also included in any linux distribution running
    under WSL.


    It is available anywhere you find Windows ports of common *nix utilities, such as the msys2 project.  (And while an msys2 installation can be quite large, it's possible to pull out individual utilities if you need to.)  Still, it's fair to say that most Windows installations don't have it.

    But surely on Windows you can just look at the file extension - if it is ".txt", it's a text file, otherwise it's a binary file.

    There are a couple ways to get it.

    The problem with this one, is /etc/magic is as old as the hills
    and does not have nearly as much capability. On the plus side,
    it's not going to burn your house down either.

    https://gnuwin32.sourceforge.net/packages/file.htm

    A second source, is Cygwin, but again, it might depend on
    when the port was done. Doing it this way has to be better
    than the previous link, just because the previous one is
    so old.

    https://cygwin.com/packages/summary/file.html

    And the Wiki on msys2 says this:

    "MSYS2 ("minimal system 2") is a software distribution and a
    development platform for Microsoft Windows, based on Mingw-w64 and Cygwin
    "

    It still means when the release was done, could matter.

    I started with Cygwin64. This is an example of an executable, but
    it relies on other dependencies.

    https://mirror.csclub.uwaterloo.ca/cygwin/x86_64/release/file/file-5.46-1-x86_64.tar.xz

    The installer is here.

    https://cygwin.com/setup-x86_64.exe

    # After installation, I checked the dependencies. This does not
    # help you find the /etc/magic file for its usage.

    $ cygcheck /usr/bin/file.exe
    C:\cygwin64\bin\file.exe
    C:\cygwin64\bin\cygmagic-1.dll
    C:\cygwin64\bin\cygbz2-1.dll
    C:\cygwin64\bin\cygwin1.dll
    C:\WINDOWS\system32\KERNEL32.dll
    C:\WINDOWS\system32\ntdll.dll
    C:\WINDOWS\system32\KERNELBASE.dll
    C:\cygwin64\bin\cyglzma-5.dll
    C:\cygwin64\bin\cygz.dll
    C:\cygwin64\bin\cygzstd-1.dll

    Testing did not go well. I tested the "find.exe" in Cygwin64
    and it did not finish. I used Process Monitor to see what it
    was doing, and there was a lot of registry activity. (There
    should not be registry activity by find.exe or file.exe )

    I tried the file.exe command and it didn't provide output
    and the machine hung. My machine never hangs. It's a model
    citizen. Windows Defender did not trip. An offline scan
    with Windows Defender did not find anything. This is possibly
    Process Monitor using all RAM, but that does not normally
    happen until 20 minutes or more have passed, and I was only
    running tracing for a minute or two.

    Cygwin materials are held on mirror sites, and I was using
    a mirror (University of Waterloo). For the time being, I would
    recommend some isolation while you test that.

    *******

    On to msys2.

    https://www.msys2.org/

    Name: msys2-x86_64-20250830.exe
    Size: 93,680,251 bytes (89 MiB)
    SHA256: B54705073678D32686A2CC356BB552363429E6CCBABBFECCB6D3CB7EC101E73B

    "Last analysis 22 hours ago", so it is likely someone in this thread triggered a retest.

    https://www.virustotal.com/gui/file/b54705073678d32686a2cc356bb552363429e6ccbabbfeccb6d3cb7ec101e73b [Clean]

    Install on disk is 350MB in C:\msys64

    https://www.msys2.org/docs/installer/

    C:/msys64/msys2_shell.cmd -defterm -here -no-start -ucrt64 # Do not run elevated (use the unelevated terminal)
    # Windows Terminal prompt changes color

    $ cd /c/msys64/usr/bin
    $ file.exe file.exe
    file.exe: PE32+ executable for MS Windows 5.02 (console), x86-64 (stripped to external PDB), 10 sections
    $ cd /s/disktype
    $ file disktype.exe
    disktype.exe: PE32 executable for MS Windows 4.00 (console), Intel i386, 16 sections # cygwin32 executable?
    # I change directory to the corrupted Sent file and check it with the msys2 version.
    $ file Sent
    Sent: Mailbox text, 1st line "From - Wed Nov 26 06:13:35 2008"
    # I compare to the WSL file command
    $ file Sent
    Sent: Non-ISO extended-ASCII text, with very long lines, with CRLF, NEL line terminators # The corruption detection...

    This tells me the msys2 has an older version of magic determination on the file.exe command .

    And for the cygwin64, use the rubber gloves on it.
    It did not work as expected. Use your SafeHex handling
    techniques, until it proves in for you.

    Paul
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Bonita Montero@Bonita.Montero@gmail.com to comp.lang.c on Tue Dec 9 15:09:11 2025
    From Newsgroup: comp.lang.c

    I made a little benchmark that compares the table code against the
    convention
    &&-cascaded code. On my Zen4-PC the table code is about 25% faster with
    clang.
    I'm doing a AVX2 and AVX-512 version now. I guess it's about 20 - 30 times faster.

    #include <iostream>

    #include <filesystem>
    #include <fstream>
    #include <algorithm>
    #include <chrono>

    using namespace std;
    using namespace filesystem;
    using namespace chrono;

    template<bool Table>
    bool binary( string const &buf );

    int main()
    {
        ifstream ifs;
        ifs.exceptions( ios_base::failbit | ios_base::badbit );
        ifs.open( "main.cpp", ios_base::binary | ios_base::ate );
        streampos pos = ifs.tellg();
        if( pos > (size_t)-1 )
            throw ios_base::failure( "file too large", error_code( (int)errc::file_too_large, generic_category() ) );
        string buf( (size_t)pos, 0 );
        ifs.seekg( 0 );
        ifs.read( buf.data(), buf.size() );
        binary<true>( buf );
        auto bench = [&]<bool Table>( bool_constant<Table> ) -> int
        {
            int ret = 0;
            auto start = high_resolution_clock::now();
            for( size_t r = 1'000'000; r; --r )
                ret += binary<Table>( buf );
            double secs = (double)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count() / 1.0e9;
            cout << (Table ? "table" : "check") << ": " << secs << endl;
            return ret;
        };
        int ret = bench( false_type() );
        ret += bench( true_type() );
        return ret;
    }

    template<bool Table>
    bool binary( string const &buf )
    {
        static auto invalid = []( unsigned char c ) static { return c <
    0x20 && c != '\r' && c != '\n' && c != '\t'; };
        if constexpr( Table )
        {
            static vector<char> invalidTbl = Table ? []()
                {
                    vector<char> ret( numeric_limits<unsigned char>::max() );
                    for( size_t c = ret.size(); c--; )
                        ret[c] = invalid( (unsigned char)c );
                    return ret;
                }() : vector<char>();
            return find_if( buf.begin(), buf.end(), [&]( unsigned char c )
    { return invalidTbl[c]; } ) == buf.end();
        }
        else
            return find_if( buf.begin(), buf.end(), invalid ) == buf.end();
    }
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.lang.c on Tue Dec 9 17:31:09 2025
    From Newsgroup: comp.lang.c

    On Tue, 9 Dec 2025 06:38:47 -0500
    Paul <nospam@needed.invalid> wrote:
    On Tue, 12/9/2025 3:03 AM, David Brown wrote:
    On 08/12/2025 21:16, Scott Lurndal wrote:
    Michael Sanders <porkchop@invalid.foo> writes:
    On Sat, 6 Dec 2025 03:14:55 -0500, Paul wrote:

    It is the year 2025.

    How many times do you suppose someone has considered this
    question ?

    I'm not trying to be a smart ass by saying this, just that the
    question is bound to be nuanced. You can do a fast and totally
    inaccurate determination. You can do a computationally expensive
    or I/O expensive determination.

    I get it Paul, but as with all things, there's lots of opinions
    on this.
    There has to be a reason for doing this, and a damn good reason.

    *******

    There is the "file" command.

    It was invented in 1973.

    https://en.wikipedia.org/wiki/File_%28command%29

    The beauty of this command, is it has some sort of ordered
    approach to file determination.

    And... is not generally available on Windows

    It is open source and could be built for windows.

    It's also included in any linux distribution running
    under WSL.


    It is available anywhere you find Windows ports of common *nix
    utilities, such as the msys2 project. (And while an msys2
    installation can be quite large, it's possible to pull out
    individual utilities if you need to.) Still, it's fair to say that
    most Windows installations don't have it.

    But surely on Windows you can just look at the file extension - if
    it is ".txt", it's a text file, otherwise it's a binary file.

    There are a couple ways to get it.

    The problem with this one, is /etc/magic is as old as the hills
    and does not have nearly as much capability. On the plus side,
    it's not going to burn your house down either.

    https://gnuwin32.sourceforge.net/packages/file.htm

    A second source, is Cygwin, but again, it might depend on
    when the port was done. Doing it this way has to be better
    than the previous link, just because the previous one is
    so old.

    https://cygwin.com/packages/summary/file.html

    And the Wiki on msys2 says this:

    "MSYS2 ("minimal system 2") is a software distribution and a
    development platform for Microsoft Windows, based on Mingw-w64
    and Cygwin "

    It still means when the release was done, could matter.

    I started with Cygwin64. This is an example of an executable, but
    it relies on other dependencies.

    https://mirror.csclub.uwaterloo.ca/cygwin/x86_64/release/file/file-5.46-1-x86_64.tar.xz

    The installer is here.

    https://cygwin.com/setup-x86_64.exe

    # After installation, I checked the dependencies. This does not
    # help you find the /etc/magic file for its usage.

    $ cygcheck /usr/bin/file.exe
    C:\cygwin64\bin\file.exe
    C:\cygwin64\bin\cygmagic-1.dll
    C:\cygwin64\bin\cygbz2-1.dll
    C:\cygwin64\bin\cygwin1.dll
    C:\WINDOWS\system32\KERNEL32.dll
    C:\WINDOWS\system32\ntdll.dll
    C:\WINDOWS\system32\KERNELBASE.dll
    C:\cygwin64\bin\cyglzma-5.dll
    C:\cygwin64\bin\cygz.dll
    C:\cygwin64\bin\cygzstd-1.dll

    Testing did not go well. I tested the "find.exe" in Cygwin64
    and it did not finish. I used Process Monitor to see what it
    was doing, and there was a lot of registry activity. (There
    should not be registry activity by find.exe or file.exe )

    I tried the file.exe command and it didn't provide output
    and the machine hung. My machine never hangs. It's a model
    citizen. Windows Defender did not trip. An offline scan
    with Windows Defender did not find anything. This is possibly
    Process Monitor using all RAM, but that does not normally
    happen until 20 minutes or more have passed, and I was only
    running tracing for a minute or two.

    Cygwin materials are held on mirror sites, and I was using
    a mirror (University of Waterloo). For the time being, I would
    recommend some isolation while you test that.

    *******

    On to msys2.

    https://www.msys2.org/

    Name: msys2-x86_64-20250830.exe
    Size: 93,680,251 bytes (89 MiB)
    SHA256:
    B54705073678D32686A2CC356BB552363429E6CCBABBFECCB6D3CB7EC101E73B

    "Last analysis 22 hours ago", so it is likely someone in this thread triggered a retest.

    https://www.virustotal.com/gui/file/b54705073678d32686a2cc356bb552363429e6ccbabbfeccb6d3cb7ec101e73b
    [Clean]

    Install on disk is 350MB in C:\msys64

    https://www.msys2.org/docs/installer/

    C:/msys64/msys2_shell.cmd -defterm -here -no-start -ucrt64 # Do not
    run elevated (use the unelevated terminal) # Windows Terminal prompt
    changes color

    $ cd /c/msys64/usr/bin
    $ file.exe file.exe
    file.exe: PE32+ executable for MS Windows 5.02 (console), x86-64
    (stripped to external PDB), 10 sections $ cd /s/disktype
    $ file disktype.exe
    disktype.exe: PE32 executable for MS Windows 4.00 (console), Intel
    i386, 16 sections # cygwin32 executable? # I change directory to
    the corrupted Sent file and check it with the msys2 version. $ file
    Sent Sent: Mailbox text, 1st line "From - Wed Nov 26 06:13:35 2008"
    # I compare to the WSL file command
    $ file Sent
    Sent: Non-ISO extended-ASCII text, with very long lines, with CRLF,
    NEL line terminators # The corruption detection...

    Below is the list of files that I needed to run copy of file.exe
    taken from msys2 on bare Windows:
    Directory of C:\tmp\tst
    12/09/2025 05:00 PM <DIR> .
    12/09/2025 04:53 PM <DIR> ..
    12/09/2025 04:54 PM 24,225 file.exe
    12/09/2025 05:00 PM 10,357,200 magic.mgc
    12/09/2025 04:57 PM 3,358,337 msys-2.0.dll
    12/09/2025 04:58 PM 67,277 msys-bz2-1.dll
    12/09/2025 04:58 PM 176,762 msys-lzma-5.dll
    12/09/2025 04:57 PM 160,362 msys-magic-1.dll
    12/09/2025 04:59 PM 88,576 msys-z.dll
    12/09/2025 04:58 PM 1,136,580 msys-zstd-1.dll
    8 File(s) 15,369,319 bytes
    2 Dir(s) 760,461,594,624 bytes free
    It's still less convenient than running from msys2 prompt, because
    by default file.exe does not look for magic.mgc in the current
    directory. So I had to run it as
    'file.exe --magic-file magic.mgc my-files'
    Can be "solved" by small envelop batch file, unless it creates some
    other inconvenience.
    This tells me the msys2 has an older version of magic determination
    on the file.exe command .

    And for the cygwin64, use the rubber gloves on it.
    It did not work as expected. Use your SafeHex handling
    techniques, until it proves in for you.

    Paul
    I never tried cygwin64. For what I do, the level of compatibility
    provided by msys2 is sufficient.
    I do have misfortune of using old cygwin, because it's how
    Altera (then Intel then again Altera) packages their Nios2 SDK. During
    the years it (cygwin) suffered from multiple issues caused by usual
    malware that IT of our company stubbornly confuses for anti-malware.
    The most recent example is Trend Micro virus that they call "antivirus"
    that on few installations (but not on all of them) silently deletes some
    vital components of cygwin.
    Recently I was glad to discover that all components of said SDK that I
    care about actually don't need cygwin. They are either proper Windows
    exe, or bash, perl and python scripts. They work fine from msys2 prompt
    and are actually faster that way than from within cygwin shell.
    So now I have grand plan to gradually stop using old cygwin altogether.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Tue Dec 9 19:53:38 2025
    From Newsgroup: comp.lang.c

    On Mon, 8 Dec 2025 18:44:33 +0000, bart wrote:

    It's not clear what the actual problem is. What is the use-case for a function that tells you whether any file /might/ be a text-file based on speculative analysis of its contents?

    Is the result /meant/ to be fuzzy?

    Hey bart.

    What I mean is that since I have not yet defined a canonical standard
    for my program, the goal here (to determine if my code can parse the file)
    is unclear.

    It means I need to plan much more *before* I write more code, no mean feat
    when one is excited & ready to jump in =)
    --
    :wq
    Mike Sanders
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Kaz Kylheku@046-301-5902@kylheku.com to comp.lang.c on Tue Dec 9 20:15:56 2025
    From Newsgroup: comp.lang.c

    On 2025-12-09, Richard Harnden <richard.nospam@gmail.invalid> wrote:
    On 09/12/2025 09:43, Richard Heathfield wrote:

    ly  - Lilypond source

    Off topic, but ... Lilypond is a lovely thing :)

    Some fifteen years ago, I banged up this in it:

    https://www.kylheku.com/~kaz/Prelude.pdf

    (Change "pdf" to "mid" for MIDI.)

    I imagine it must have improved quite a bit since then.
    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.lang.c on Tue Dec 9 12:45:24 2025
    From Newsgroup: comp.lang.c

    On 12/8/2025 10:09 AM, Michael Sanders wrote:
    On Sun, 7 Dec 2025 14:42:39 -0800, Chris M. Thomasson wrote:

    You can return a float from is_binary_file() to show a probability? Not
    exactly sure how you can 100% guarantee it...

    Ha!

    You know, that's a crazy idea but a darn cool idea at the same time!


    ;^)

    It would be funny with a return of .5, lol

    An error can be a negative result.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From James Kuyper@jameskuyper@alumni.caltech.edu to comp.lang.c on Tue Dec 9 16:23:07 2025
    From Newsgroup: comp.lang.c

    On Mon, 8 Dec 2025 18:44:33 +0000, bart wrote:

    It's not clear what the actual problem is. What is the use-case for a function that tells you whether any file /might/ be a text-file based on speculative analysis of its contents?

    Is the result /meant/ to be fuzzy?
    The fundamental problem is that no analysis of the contents can give you anything other than a fuzzy result. There's nothing more clearly a
    binary file than one that contains an array of binary floating point
    numbers. However, just by chance, the binary numbers it contains could
    happen to be such that every byte of that file can be interpreted as a
    text character. How could an analysis of only the file tell you, with certainty, that it wasn't a text file?
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From James Kuyper@jameskuyper@alumni.caltech.edu to comp.lang.c on Tue Dec 9 16:29:39 2025
    From Newsgroup: comp.lang.c

    On 2025-12-06 20:37, James Kuyper wrote:
    ...
    "Data read in from a text stream will necessarily compare equal to the
    data that were earlier written out to that stream only if: the data
    consist only of printing characters and the control characters
    horizontal tab and new-line; no new-line character is immediately
    preceded by space characters; and the last character is a new-line character." (7.23.2p2).

    I believe it therefore makes sense to consider something to be a text
    file if it meets those requirements, and otherwise is a binary file.
    Note that the last requirement implies that an empty file cannot qualify
    as text - at a minimum, it must contain a new-line character.

    This implies the use of the isprint() function; the only other
    characters you need to handle specifically are '\t', '\n', and ' '.
    Since the result returned by isprint() is locale-dependent, the program should, at least optionally, use setlocale().

    I just realized an annoying complication. Whatever
    implementation-specific method is used to indicate end-of-line can only
    be portably identified as such by opening the file in text mode and
    looking for the newline characters that it gets converted into. But
    because of 7.23.2p2, text mode cannot be relied upon for precisely the
    files we're trying to identify.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Tue Dec 9 21:38:57 2025
    From Newsgroup: comp.lang.c

    On Mon, 08 Dec 2025 14:43:58 -0800, Keith Thompson wrote:

    For yet another set of unreliable hueristics for guessing whether a file
    is text or binary, you can take a look at Perl's built-in "-T" and "-B" operators.

    I guess the key finding in all of these cases really is unreliable.
    Heuristics is the only 'constant' ie - an educated guess.

    I wont win this battle, I can see it coming & then as James pointed
    out, the ambiguous stuff with unicode...
    --
    :wq
    Mike Sanders
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Tue Dec 9 21:49:12 2025
    From Newsgroup: comp.lang.c

    On Mon, 8 Dec 2025 19:26:07 -0000 (UTC), Kaz Kylheku wrote:

    At last, someone seems to have gotten the joke.

    I had originally intended to reply (without the hints):

    c: 01100011
    h: 01101000
    u: 01110101
    c: 01100011
    k: 01101011
    l: 01101100
    e: 01100101

    But figured it could lead to a shellacking...
    --
    :wq
    Mike Sanders
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Keith Thompson@Keith.S.Thompson+u@gmail.com to comp.lang.c on Tue Dec 9 15:42:59 2025
    From Newsgroup: comp.lang.c

    Michael Sanders <porkchop@invalid.foo> writes:
    On Mon, 8 Dec 2025 18:44:33 +0000, bart wrote:
    It's not clear what the actual problem is. What is the use-case
    for a function that tells you whether any file /might/ be a
    text-file based on speculative analysis of its contents? Is
    the result /meant/ to be fuzzy?

    Hey bart.

    What I mean is that since I have not yet defined a canonical
    standard for my program, the goal here (to determine if my code
    can parse the file) is unclear.

    It means I need to plan much more *before* I write more code, no
    mean feat when one is excited & ready to jump in =)

    You say you want to parse the file. That implies that you expect
    the file to have a certain format/syntax, and for parsing to fail
    on a file that doesn't satisfy the syntax. In that case, I
    speculate that determining whether the file is text or binary is
    not useful. The way to determine whether you can parse it is
    simply to try to parse it, and see whether that succeeds or fails.
    For example, if I want to parse a file containing a C translation
    unit, I can feed it to a C compiler (or just a parser if I have
    one). If the file contains non-text bytes, that's just a special
    case of a syntactically incorrect input, and the parser will
    detect it. It should work similarly for whatever format you're
    trying to parse. I doubt that you need to distinguish between
    incorrect input that's pure text and incorrect input that's
    "binary". If I'm right about this (which is by no means
    certain), you could have saved a lot of time by telling us up
    front *why* you want to distinguish between "text" and "binary"
    files. On the other hand, I've seized on the word "parse", and I
    may be reading too much into it.
    --
    Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
    void Void(void) { Void(); } /* The recursive call of the void */
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Paul@nospam@needed.invalid to comp.lang.c on Tue Dec 9 20:26:47 2025
    From Newsgroup: comp.lang.c

    On Tue, 12/9/2025 6:22 AM, tTh wrote:
    On 12/9/25 09:03, David Brown wrote:

    But surely on Windows you can just look at the file extension - if it is ".txt", it's a text file, otherwise it's a binary file.

       And what about PNM files who can be pure ascii encoded,
       but was image files ?


    teapot.ppm 196,623 bytes

    50 36 0A 32 35 36 20 32 35 36 0A 32 35 35 0A # P6
    # 256 256
    # 255
    13 5C C0 13 5C C0 13 5C C0 13 5C C0 13 5C C0 # binary byte tuples 0x13 0x5C 0xC0

    ******************************************************************

    teapot2.ppm 710,359 bytes

    P3 # P3 is the ASCII format option
    # Created by IrfanView # (How you change storage formats)
    256 256
    255
    19 92 192 19 92 192 19 92 192 19 92 192 19 92 192 # Plain ASCII digits (inefficient)

    PNM supports both ASCII and binary payloads.
    The magic value of P3 or P6 indicates the PPM payload types in the examples.

    ********************************************************************

    $ file *ppm
    teapot.ppm: Netpbm image data, size = 256 x 256, rawbits, pixmap
    teapot2.ppm: Netpbm image data, size = 256 x 256, pixmap, ASCII text

    Paul


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Bonita Montero@Bonita.Montero@gmail.com to comp.lang.c on Wed Dec 10 09:18:03 2025
    From Newsgroup: comp.lang.c

    Now I've developed a benchmark which tests the static comparison approach
    vs. the table approach vs. an AVX2-approach vs. an AVX-512 approach. This
    are the results with clang 20:

        check: 2.17442
        table: 2.00056 (109%)
        AVX-256: 0.183048 (1093%, 1188%)
        AVX-512: 0.0639528 (286%, 3128%, 3400%)

    The number in the brackets are the speedups against the before results.
    So the AVX-512 solution is 30+ times than the byte-wise solutions.

    This is the code:

    #include <iostream>
    #include <filesystem>
    #include <fstream>
    #include <algorithm>
    #include <chrono>
    #include <span>
    #include <intrin.h>
    #include <array>
    #include <functional>
    #include "inline.h"

    using namespace std;
    using namespace filesystem;
    using namespace chrono;

    template<bool Table>
    bool binary( string const &buf );
    template<bool Avx512>
    bool binaryAvx( string const &buf );

    int main()
    {
        ifstream ifs;
        ifs.exceptions( ios_base::failbit | ios_base::badbit );
        ifs.open( "main.cpp", ios_base::binary | ios_base::ate );
        streampos pos = ifs.tellg();
        if( pos > (size_t)-1 )
            throw ios_base::failure( "file too large", error_code( (int)errc::file_too_large, generic_category() ) );
        string buf( (size_t)pos, 0 );
        ifs.seekg( 0 );
        ifs.read( buf.data(), buf.size() );
        array<double, 4> results;
        using test_fn = function<bool ( string const & )>;
        auto bench = [&]( size_t i, char const *what, test_fn const &test ) L_FORCEINLINE -> int
        {
            int ret = 0;
            auto start = high_resolution_clock::now();
    #if defined(NDEBUG)
            constexpr size_t N = 1'000'000;
    #else
            constexpr size_t N = 1'000;
    #endif
            for( size_t r = N; r; --r )
                ret += test( buf );
            double secs = (double)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count() / 1.0e9;
            cout << what << ": " << secs;
            results[i] = secs;
            if( i )
            {
                cout << " (";
                do
                {
                    cout << (int)(100.0 * results[--i] / secs + 0.5) << "%";
                    if( i )
                        cout << ", ";
                } while( i );
                cout << ")";
            }
            cout << endl;
            return ret;
        };
        struct test { char const *descr; test_fn fn; };
        array<test, 4> tests =
        {
            test( "check", +[]( string const &str ) -> int { return binary<false>( str ); } ),
            test( "table", +[]( string const &str ) -> int { return binary<true>( str ); } ),
            test( "AVX-256", +[]( string const &str ) -> int { return binaryAvx<false>( str ); } ),
            test( "AVX-512", +[]( string const &str ) -> int { return binaryAvx<true>( str ); } )
        };
        int ret = 0;
        for( size_t t = 0; test const &test : tests )
            ret += bench( t++, test.descr, test.fn );
        return ret;
    }

    template<bool Table>
    bool binary( string const &buf )
    {
        static auto invalid = []( unsigned char c ) static { return c <
    0x20 && c != '\r' && c != '\n' && c != '\t'; };
        if constexpr( Table )
        {
            static vector<char> invalidTbl = Table ? []()
                {
                    vector<char> ret( numeric_limits<unsigned char>::max() );
                    for( size_t c = ret.size(); c--; )
                        ret[c] = invalid( (unsigned char)c );
                    return ret;
                }() : vector<char>();
            return find_if( buf.begin(), buf.end(), [&]( unsigned char c )
    { return invalidTbl[c]; } ) == buf.end();
        }
        else
            return find_if( buf.begin(), buf.end(), invalid ) == buf.end();
    }

    template<bool Avx512>
    bool binaryAvx( string const &buf )
    {
        char const
            *pBegin = buf.data(),
            *pEnd = pBegin + buf.size();
        if constexpr( Avx512 )
        {
            size_t
                head = (size_t)pBegin & 63,
                tail = (size_t)pEnd & 63;
            span<__m512i const> range( (__m512i *)(pBegin - head), (__m512i *)(pEnd - tail + (tail ? 64 : 0)) );
            __m512i const
                printable = _mm512_set1_epi8( (char)0x20 ),
                cr = _mm512_set1_epi8( (char)'\r' ),
                lf = _mm512_set1_epi8( (char)'\n' ),
                tab = _mm512_set1_epi8( (char)'\t' );
            uint64_t mask = (uint64_t)-1ll << head;
            auto cur = range.begin(), end = range.end();
            auto doChunk = [&]() -> bool
            {
                __m512i chunk = _mm512_loadu_epi8( (void *)to_address( cur ) );
                uint64_t
                    spaMask = _mm512_cmpge_epu8_mask( chunk, printable ),
                    crMask = _mm512_cmpeq_epi8_mask( chunk, cr ),
                    lfMask = _mm512_cmpeq_epi8_mask( chunk, lf ),
                    tabMask = _mm512_cmpeq_epi8_mask( chunk, tab );
                return ((spaMask | crMask | lfMask | tabMask) & mask) == mask;
            };
            for( ; cur != end - (bool)tail; ++cur, mask = -1ll )
                if( !doChunk() )
                    return false;
            if( tail )
            {
                mask = ~((uint64_t)-1ll << tail);
                if( !doChunk() )
                    return false;
            }
        }
        else
        {
            size_t
                head = (size_t)pBegin & 31,
                tail = (size_t)pEnd & 31;
            span<__m256i const> range( (__m256i *)(pBegin - head), (__m256i *)(pEnd - tail + (tail ? 32 : 0)) );
            __m256i const
                zero = _mm256_setzero_si256(),
                printable = _mm256_set1_epi8( (char)0xE0 ),
                cr = _mm256_set1_epi8( (char)'\r' ),
                lf = _mm256_set1_epi8( (char)'\n' ),
                tab = _mm256_set1_epi8( (char)'\t' );
            uint32_t mask = (uint32_t)-1 << head;
            auto cur = range.begin(), end = range.end();
            auto doChunk = [&]() -> bool
            {
                __m256i chunk = _mm256_loadu_epi8( (void *)to_address( cur ) );
                uint32_t
                    spaMask = ~_mm256_movemask_epi8( _mm256_cmpeq_epi8( _mm256_and_si256( chunk, printable ), zero ) ),
                    crMask = _mm256_movemask_epi8( _mm256_cmpeq_epi8( chunk, cr ) ),
                    lfMask = _mm256_movemask_epi8( _mm256_cmpeq_epi8( chunk, lf ) ),
                    tabMask = _mm256_movemask_epi8 (_mm256_cmpeq_epi8( chunk, tab ) );
                return ((spaMask | crMask | lfMask | tabMask) & mask) == mask;
            };
            for( ; cur != end - (bool)tail; ++cur, mask = -1 )
                if( !doChunk() )
                    return false;
            if( tail )
            {
                mask = ~((uint32_t)-1 << tail);
                if( !doChunk() )
                    return false;
            }
        }
        return true;
    }

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.lang.c on Wed Dec 10 11:21:32 2025
    From Newsgroup: comp.lang.c

    On Tue, 9 Dec 2025 16:29:39 -0500
    James Kuyper <jameskuyper@alumni.caltech.edu> wrote:

    On 2025-12-06 20:37, James Kuyper wrote:
    ...
    "Data read in from a text stream will necessarily compare equal to
    the data that were earlier written out to that stream only if: the
    data consist only of printing characters and the control characters horizontal tab and new-line; no new-line character is immediately
    preceded by space characters; and the last character is a new-line character." (7.23.2p2).

    I believe it therefore makes sense to consider something to be a
    text file if it meets those requirements, and otherwise is a binary
    file. Note that the last requirement implies that an empty file
    cannot qualify as text - at a minimum, it must contain a new-line character.

    This implies the use of the isprint() function; the only other
    characters you need to handle specifically are '\t', '\n', and ' '.
    Since the result returned by isprint() is locale-dependent, the
    program should, at least optionally, use setlocale().

    I just realized an annoying complication. Whatever
    implementation-specific method is used to indicate end-of-line can
    only be portably identified as such by opening the file in text mode
    and looking for the newline characters that it gets converted into.
    But because of 7.23.2p2, text mode cannot be relied upon for
    precisely the files we're trying to identify.

    Does not sound like a problem. According to my understanding, wide
    portability was never a part of the OP's spec.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Wed Dec 10 11:35:48 2025
    From Newsgroup: comp.lang.c

    On Sat, 6 Dec 2025 02:00:22 -0000 (UTC), Lew Pitcher wrote:

    I should have added that I feel that you probably haven't really
    defined /what/ "text file" means, and that has interfered with
    the development of this function. As Keith pointed out, the task
    of distinguishing between a "text" file and a "binary" file is not
    easy. I'll add that a lot of the difficulty stems from the fact
    that there are many definitions (some conflicting) of what a "text"
    file actually contains.

    Yes. Here's my 2nd attempt following the template (of thinking)
    you've suggested...

    #include <stdio.h> // FILE, fopen, fread, fclose
    #include <stddef.h> // size_t

    // is_text_file()
    // Returns:
    // -1 : could not open file
    // 0 : is NOT a text file (binary indicators found)
    // 1 : is PROBABLY a text file (no strong binary signatures)

    int is_text_file(const char *path) {
    // Try opening the file in binary mode,
    // required so that bytes are read exact.
    FILE *f = fopen(path, "rb");
    if (!f) return -1; // Could not open file

    unsigned char buf[4096]; // 4KB chunks
    size_t n, i;

    // Read in file until EOF
    while ((n = fread(buf, 1, sizeof(buf), f)) > 0) {
    for (i = 0; i < n; i++) {
    unsigned char c = buf[i];

    // 1. null byte is a very strong indication of binary data.
    // Text files virtually never contain 0x00.
    if (c == 0x00) {
    fclose(f);
    return 0; // Contains binary-only byte: NOT text
    }

    // 2. Check for raw C0 control codes (0x01–0x1F).
    // We *allow* \t (09), \n (0A), \r (0D) because they are normal in text.
    // Any other control code is highly suspicious and usually means binary.
    if (c < 0x20) {
    if (c != 0x09 && c != 0x0A && c != 0x0D) {
    fclose(f);
    return 0; // unexpected control character → NOT text
    }
    }

    // 3. NOTE: We intentionally do *not* reject bytes >= 0x80.
    // These occur in UTF-8, extended ASCII, and many local encodings.
    // Rejecting them would treat valid multilingual text as binary.
    // So we treat high bytes as acceptable for "probably text".
    }
    }

    fclose(f);
    return 1; // Probably text (no strong binary signatures found)
    }
    --
    :wq
    Mike Sanders
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Wed Dec 10 11:38:47 2025
    From Newsgroup: comp.lang.c

    On Tue, 9 Dec 2025 16:29:39 -0500, James Kuyper wrote:

    [...]

    James if you can manage a spare moment, see my reply
    to Lew ie - is_text_file()

    Would like your critique.
    --
    :wq
    Mike Sanders
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Wed Dec 10 11:41:00 2025
    From Newsgroup: comp.lang.c

    On Tue, 09 Dec 2025 15:42:59 -0800, Keith Thompson wrote:

    [...]

    Keith if you get a chance see my reply to Lew 'is_text_file()'

    Let me know if I've inched closer a step or two...
    --
    :wq
    Mike Sanders
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.lang.c on Wed Dec 10 15:07:30 2025
    From Newsgroup: comp.lang.c

    Michael Sanders <porkchop@invalid.foo> writes:
    On Sat, 6 Dec 2025 02:00:22 -0000 (UTC), Lew Pitcher wrote:

    I should have added that I feel that you probably haven't really
    defined /what/ "text file" means, and that has interfered with
    the development of this function. As Keith pointed out, the task
    of distinguishing between a "text" file and a "binary" file is not
    easy. I'll add that a lot of the difficulty stems from the fact
    that there are many definitions (some conflicting) of what a "text"
    file actually contains.

    Yes. Here's my 2nd attempt following the template (of thinking)
    you've suggested...

    The problem with all of your attempts is the performance
    issue. Success requires reading every single byte of the
    file, one byte at a time. The word 'slow' is not sufficient
    to describe how bad the performance will be for a very large
    file.

    At a minimum, dump the stdio double-buffered byte-by-byte
    algorithm and use mmap().

    In reality, I still don't see any benefit to this type of
    heuristic-based approach.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lew Pitcher@lew.pitcher@digitalfreehold.ca to comp.lang.c on Wed Dec 10 15:58:41 2025
    From Newsgroup: comp.lang.c

    On Wed, 10 Dec 2025 11:35:48 +0000, Michael Sanders wrote:

    On Sat, 6 Dec 2025 02:00:22 -0000 (UTC), Lew Pitcher wrote:

    I should have added that I feel that you probably haven't really
    defined /what/ "text file" means, and that has interfered with
    the development of this function. As Keith pointed out, the task
    of distinguishing between a "text" file and a "binary" file is not
    easy. I'll add that a lot of the difficulty stems from the fact
    that there are many definitions (some conflicting) of what a "text"
    file actually contains.

    Yes. Here's my 2nd attempt following the template (of thinking)
    you've suggested...

    FWIW, my opinion doesn't matter in the measure of whether or not you have written a competent is_text_file() function; what matters is that it
    fits (or does not fit) the use-case you wrote it for. If it were me,
    I'd have a hard time writing this function, because I don't know your
    use-case, and I'd try to generalize it. I've worked with text files
    stored in ASCII, and in EBCDIC, and in various Unicode formats, and
    (god help me) in a bunch of other formats as well, and I'd have a hard
    time generalizing all that into a universal is_text_file() function.

    So, my real advice is to pick your battles, and document exactly what
    sort of text file you intend to look for with this function. What
    you've wrote might suit your needs exactly, without accounting for
    all the variations of what a text file consists of.


    #include <stdio.h> // FILE, fopen, fread, fclose
    #include <stddef.h> // size_t

    // is_text_file()
    // Returns:
    // -1 : could not open file
    // 0 : is NOT a text file (binary indicators found)
    // 1 : is PROBABLY a text file (no strong binary signatures)

    int is_text_file(const char *path) {
    // Try opening the file in binary mode,
    // required so that bytes are read exact.
    FILE *f = fopen(path, "rb");
    if (!f) return -1; // Could not open file

    unsigned char buf[4096]; // 4KB chunks
    size_t n, i;

    // Read in file until EOF
    while ((n = fread(buf, 1, sizeof(buf), f)) > 0) {
    for (i = 0; i < n; i++) {
    unsigned char c = buf[i];

    // 1. null byte is a very strong indication of binary data.
    // Text files virtually never contain 0x00.

    Except for UTF16 and UTF32 text files, of course.

    So, part of your definition of what constitutes a text file is that
    a text file (at least as far as is_text_file() is concerned) does not
    contain any UTF16 or UTF32 characters.


    if (c == 0x00) {
    fclose(f);
    return 0; // Contains binary-only byte: NOT text
    }

    // 2. Check for raw C0 control codes (0x01–0x1F).
    // We *allow* \t (09), \n (0A), \r (0D) because they are normal in text.
    // Any other control code is highly suspicious and usually means binary.
    if (c < 0x20) {
    if (c != 0x09 && c != 0x0A && c != 0x0D) {

    Except for all the flavours of EBCDIC.

    So, another part of your definition of what constitutes a text file is that
    a text file (at least as far as is_text_file() is concerned) does not contain EBCDIC

    fclose(f);
    return 0; // unexpected control character → NOT text
    }
    }

    // 3. NOTE: We intentionally do *not* reject bytes >= 0x80.
    // These occur in UTF-8, extended ASCII, and many local encodings.
    // Rejecting them would treat valid multilingual text as binary.
    // So we treat high bytes as acceptable for "probably text".

    Except for ASCII, which is limited to 7bit characters between 0x00 and 0x7f (ignoring, of course, those text files that store ASCII with even or odd parity)

    So, another part of your definition of what constitutes a text file is that
    a text file (at least as far as is_text_file() is concerned) may contain
    ASCII, but is not guaranteed to do so.

    }
    }

    fclose(f);
    return 1; // Probably text (no strong binary signatures found)
    }
    --
    Lew Pitcher
    "In Skills We Trust"
    Not LLM output - I'm just like this.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.lang.c on Wed Dec 10 19:00:38 2025
    From Newsgroup: comp.lang.c

    On Wed, 10 Dec 2025 15:07:30 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael Sanders <porkchop@invalid.foo> writes:
    On Sat, 6 Dec 2025 02:00:22 -0000 (UTC), Lew Pitcher wrote:

    I should have added that I feel that you probably haven't really
    defined /what/ "text file" means, and that has interfered with
    the development of this function. As Keith pointed out, the task
    of distinguishing between a "text" file and a "binary" file is not
    easy. I'll add that a lot of the difficulty stems from the fact
    that there are many definitions (some conflicting) of what a "text"
    file actually contains.

    Yes. Here's my 2nd attempt following the template (of thinking)
    you've suggested...

    The problem with all of your attempts is the performance
    issue. Success requires reading every single byte of the
    file, one byte at a time. The word 'slow' is not sufficient
    to describe how bad the performance will be for a very large
    file.

    At a minimum, dump the stdio double-buffered byte-by-byte
    algorithm and use mmap().


    I suggest to do actual speed measurements before making bold
    claims like above. Don't trust your intuition!

    In reality, I still don't see any benefit to this type of
    heuristic-based approach.


    Neither do I. But OP is not doing it for us, but for himself.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.lang.c on Wed Dec 10 17:18:45 2025
    From Newsgroup: comp.lang.c

    Michael S <already5chosen@yahoo.com> writes:
    On Wed, 10 Dec 2025 15:07:30 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael Sanders <porkchop@invalid.foo> writes:
    On Sat, 6 Dec 2025 02:00:22 -0000 (UTC), Lew Pitcher wrote:

    I should have added that I feel that you probably haven't really
    defined /what/ "text file" means, and that has interfered with
    the development of this function. As Keith pointed out, the task
    of distinguishing between a "text" file and a "binary" file is not
    easy. I'll add that a lot of the difficulty stems from the fact
    that there are many definitions (some conflicting) of what a "text"
    file actually contains.

    Yes. Here's my 2nd attempt following the template (of thinking)
    you've suggested...

    The problem with all of your attempts is the performance
    issue. Success requires reading every single byte of the
    file, one byte at a time. The word 'slow' is not sufficient
    to describe how bad the performance will be for a very large
    file.

    At a minimum, dump the stdio double-buffered byte-by-byte
    algorithm and use mmap().


    I suggest to do actual speed measurements before making bold
    claims like above. Don't trust your intuition!

    I have, more than once, done such measurements after mmap()
    was introduced in SVR4 circa 1989 (ported from SunOS).

    On a single-user system, running a single job, the difference
    for smaller files is in the noise. For larger files, or when
    the system is heavily loaded or multiuser, it can be significant.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From James Kuyper@jameskuyper@alumni.caltech.edu to comp.lang.c on Wed Dec 10 12:46:36 2025
    From Newsgroup: comp.lang.c

    On 2025-12-10 06:35, Michael Sanders wrote:
    ...
    #include <stdio.h> // FILE, fopen, fread, fclose
    #include <stddef.h> // size_t

    // is_text_file()
    // Returns:
    // -1 : could not open file
    // 0 : is NOT a text file (binary indicators found)
    // 1 : is PROBABLY a text file (no strong binary signatures)

    int is_text_file(const char *path) {
    // Try opening the file in binary mode,
    // required so that bytes are read exact.
    FILE *f = fopen(path, "rb");
    if (!f) return -1; // Could not open file

    unsigned char buf[4096]; // 4KB chunks
    size_t n, i;

    // Read in file until EOF
    while ((n = fread(buf, 1, sizeof(buf), f)) > 0) {
    for (i = 0; i < n; i++) {
    unsigned char c = buf[i];

    I'd recommend against buffering this; C stdio is already buffered, and
    it just complicates your code to keep track of a second level of
    buffering. Use getc() instead.

    // 1. null byte is a very strong indication of binary data.
    // Text files virtually never contain 0x00.
    if (c == 0x00) {
    fclose(f);
    return 0; // Contains binary-only byte: NOT text
    }

    // 2. Check for raw C0 control codes (0x01–0x1F).
    // We *allow* \t (09), \n (0A), \r (0D) because they are normal in text.
    // Any other control code is highly suspicious and usually means binary.
    if (c < 0x20) {
    if (c != 0x09 && c != 0x0A && c != 0x0D) {
    fclose(f);
    return 0; // unexpected control character → NOT text
    }
    }

    I would recommend against use of explicit numerical codes for
    characters. They make your code dependent upon a particular encoding,
    and you're free to make that choice, but for implementations where that encoding is the default, the corresponding C escape sequences will have precisely the the correct value, and make it easier to understand what
    your code is doing:

    0x00 '\0'
    0x09 '\t'
    0x0A '\n'
    0x0D '\r'
    0x20 ' '

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From James Kuyper@jameskuyper@alumni.caltech.edu to comp.lang.c on Wed Dec 10 12:48:06 2025
    From Newsgroup: comp.lang.c

    On 2025-12-10 04:21, Michael S wrote:
    On Tue, 9 Dec 2025 16:29:39 -0500
    James Kuyper <jameskuyper@alumni.caltech.edu> wrote:
    ...
    I just realized an annoying complication. Whatever
    implementation-specific method is used to indicate end-of-line can
    only be portably identified as such by opening the file in text mode
    and looking for the newline characters that it gets converted into.
    But because of 7.23.2p2, text mode cannot be relied upon for
    precisely the files we're trying to identify.

    Does not sound like a problem. According to my understanding, wide portability was never a part of the OP's spec.

    His spec was unclear. At least part of my intent in raising these issues
    is to point out issues that he might not want to deal with, and which he
    can justify ignoring by specifying that his routine is not intended to
    deal with them.
    Thinking about this particular problem, I see no way to deal with it in general. Had I a need to write such a routine, I'd be happy to restrict
    the validity of my code to platforms where end-of-line is is indicated
    by a single new-line character. However, I suspect he might need Windows compatibility, and might not need portability to Unix-like systems.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Wed Dec 10 18:41:22 2025
    From Newsgroup: comp.lang.c

    On Wed, 10 Dec 2025 11:35:48 -0000 (UTC), Michael Sanders wrote:

    Yes. Here's my 2nd attempt...

    [...]

    Last version for me (I have to pivot to other things).

    Main change is a look up table, ought to provide
    optional future extensibility...

    Earnest thanks to each & all =)

    #include <stdio.h> // FILE, fopen, fread, fclose
    #include <stddef.h> // size_t

    // is_text_file()
    // Returns:
    // -1 : could not open file
    // 0 : is NOT a text file (binary indicators found)
    // 1 : is PROBABLY a text file (no strong binary signatures)

    int is_text_file(const char *path) {
    FILE *f = fopen(path, "rb");
    if (!f) return -1;

    unsigned char chunk[4096]; // 4KB
    size_t n, i;

    // Look Up Table: 1 = allowed in text, 0 = binary indicator
    // Allows TAB(0x09), LF(0x0A), CR(0x0D), printable ASCII (0x20–0x7E)
    static const unsigned char LUT[128] = {
    0,0,0,0,0,0,0,0,0,1,1,0,0,1,0,0, // 0x00–0x0F
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 0x10–0x1F
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 0x20–0x2F
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 0x30–0x3F
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 0x40–0x4F
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 0x50–0x5F
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 0x60–0x6F
    1,1,1,1,1,1,1,1,1,1,1,0 // 0x70–0x7F, last 0 = DEL
    };

    while ((n = fread(chunk, 1, sizeof(chunk), f)) > 0) {
    for (i = 0; i < n; i++) {
    if (chunk[i] < 128 && !LUT[chunk[i]]) {
    fclose(f);
    return 0; // binary indicator found
    }
    // bytes >= 128 are accepted as probably text
    }
    }

    fclose(f);
    return 1; // probably text
    }
    --
    :wq
    Mike Sanders
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Wed Dec 10 18:42:47 2025
    From Newsgroup: comp.lang.c

    On Wed, 10 Dec 2025 15:07:30 GMT, Scott Lurndal wrote:

    The problem with all of your attempts is the performance
    issue. Success requires reading every single byte of the
    file, one byte at a time. The word 'slow' is not sufficient
    to describe how bad the performance will be for a very large
    file.

    At a minimum, dump the stdio double-buffered byte-by-byte
    algorithm and use mmap().

    In reality, I still don't see any benefit to this type of
    heuristic-based approach.

    Yeah agreed, its one of those things...
    --
    :wq
    Mike Sanders
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Wed Dec 10 18:44:15 2025
    From Newsgroup: comp.lang.c

    On Wed, 10 Dec 2025 15:58:41 -0000 (UTC), Lew Pitcher wrote:

    [...]

    Thanks Lew. I'm stumped, but learned allot.
    --
    :wq
    Mike Sanders
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Wed Dec 10 18:45:39 2025
    From Newsgroup: comp.lang.c

    On Wed, 10 Dec 2025 12:46:36 -0500, James Kuyper wrote:

    I would recommend against use of explicit numerical codes for
    characters. They make your code dependent upon a particular encoding,
    and you're free to make that choice, but for implementations where that encoding is the default, the corresponding C escape sequences will have precisely the the correct value, and make it easier to understand what
    your code is doing:

    0x00 '\0'
    0x09 '\t'
    0x0A '\n'
    0x0D '\r'
    0x20 ' '

    Aye, moving towards that (eventually).

    Thanks for your comments James.
    --
    :wq
    Mike Sanders
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Richard Heathfield@rjh@cpax.org.uk to comp.lang.c on Wed Dec 10 19:42:24 2025
    From Newsgroup: comp.lang.c

    On 10/12/2025 17:18, Scott Lurndal wrote:
    Michael S <already5chosen@yahoo.com> writes:
    On Wed, 10 Dec 2025 15:07:30 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael Sanders <porkchop@invalid.foo> writes:
    On Sat, 6 Dec 2025 02:00:22 -0000 (UTC), Lew Pitcher wrote:

    I should have added that I feel that you probably haven't really
    defined /what/ "text file" means, and that has interfered with
    the development of this function. As Keith pointed out, the task
    of distinguishing between a "text" file and a "binary" file is not
    easy. I'll add that a lot of the difficulty stems from the fact
    that there are many definitions (some conflicting) of what a "text"
    file actually contains.

    Yes. Here's my 2nd attempt following the template (of thinking)
    you've suggested...

    The problem with all of your attempts is the performance
    issue. Success requires reading every single byte of the
    file, one byte at a time. The word 'slow' is not sufficient
    to describe how bad the performance will be for a very large
    file.

    At a minimum, dump the stdio double-buffered byte-by-byte
    algorithm and use mmap().


    I suggest to do actual speed measurements before making bold
    claims like above. Don't trust your intuition!

    I have, more than once, done such measurements after mmap()
    was introduced in SVR4 circa 1989 (ported from SunOS).

    On a single-user system, running a single job, the difference
    for smaller files is in the noise. For larger files, or when
    the system is heavily loaded or multiuser, it can be significant.

    1989 is 36 years ago. Technology has moved on. If reading your
    file is too slow to read, get yourself a real computer.

    On my very ordinary desktop machine, I just freq'd[1] a
    7,032,963,565-byte file in 12.256 seconds. That's 573,838,410
    bytes per second. It's a damn sight faster than I could do by hand.

    How, exactly, are you using `slow'?


    [1] Nothing fancy; a getc loop with ++pfm[ch].count written
    entirely in what used to be called clc-conforming code, and I can
    see at least one egregious inefficiency in the code that I can't
    be bothered to fix because half a gig a second is *easily* fast
    enough for my needs.
    --
    Richard Heathfield
    Email: rjh at cpax dot org dot uk
    "Usenet is a strange place" - dmr 29 July 1999
    Sig line 4 vacant - apply within
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Wed Dec 10 20:57:49 2025
    From Newsgroup: comp.lang.c

    On Wed, 10 Dec 2025 18:41:22 -0000 (UTC), Michael Sanders wrote:

    Last version for me (I have to pivot to other things).

    [...]

    smaller look up table still + bit shifting!

    *fastest implantation yet* but virtually unreadable =(

    #include <stdio.h>
    #include <stddef.h>
    #include <stdint.h>

    // is_text_file()
    // Returns:
    // -1 : could not open file
    // 0 : is NOT a text file (binary indicators found)
    // 1 : is PROBABLY a text file (no strong binary signatures)

    int is_text_file(const char *path) {
    FILE *f = fopen(path, "rb");
    if (!f) return -1;

    unsigned char chunk[4096];
    size_t n, i;

    // 128-bit bitmask (16 bytes × 8 bits / byte), 1=allowed, 0=disallowed
    // Allowed bytes: TAB(0x09), LF(0x0A), CR(0x0D), printable ASCII 0x20–0x7E

    static const uint8_t MASK[16] = {
    0x00, 0x24, 0x00, 0x00, // 0x00–0x0F: TAB(09), LF(0A), CR(0D)
    0xFF, 0xFF, 0xFF, 0xFF, // 0x10–0x2F: SPC!"#$%&'()*+,-./
    0xFF, 0xFF, 0xFF, 0xFF, // 0x30–0x4F: 0123456789:;<=>?@
    0xFF, 0xFF, 0xFF, 0x7F // 0x50–0x7F: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdef...
    };

    while ((n = fread(chunk, 1, sizeof(chunk), f)) > 0) {
    for (i = 0; i < n; i++) {
    if (chunk[i] < 128 && !(MASK[chunk[i] >> 3] & (1 << (chunk[i] & 7)))) {
    fclose(f);
    return 0; // binary indicator found
    }
    // bytes >= 128 are accepted as probably text
    }
    }

    fclose(f);
    return 1; // probably text
    }
    --
    :wq
    Mike Sanders
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.lang.c on Wed Dec 10 22:07:24 2025
    From Newsgroup: comp.lang.c

    Michael Sanders <porkchop@invalid.foo> writes:
    On Wed, 10 Dec 2025 18:41:22 -0000 (UTC), Michael Sanders wrote:

    Last version for me (I have to pivot to other things).

    [...]

    smaller look up table still + bit shifting!

    *fastest implantation yet* but virtually unreadable =(

    #include <stdio.h>
    #include <stddef.h>
    #include <stdint.h>

    // is_text_file()
    // Returns:
    // -1 : could not open file
    // 0 : is NOT a text file (binary indicators found)
    // 1 : is PROBABLY a text file (no strong binary signatures)

    int is_text_file(const char *path) {
    FILE *f = fopen(path, "rb");
    if (!f) return -1;

    unsigned char chunk[4096];
    size_t n, i;

    // 128-bit bitmask (16 bytes × 8 bits / byte), 1=allowed, 0=disallowed
    // Allowed bytes: TAB(0x09), LF(0x0A), CR(0x0D), printable ASCII 0x20–0x7E

    static const uint8_t MASK[16] = {
    0x00, 0x24, 0x00, 0x00, // 0x00–0x0F: TAB(09), LF(0A), CR(0D)
    0xFF, 0xFF, 0xFF, 0xFF, // 0x10–0x2F: SPC!"#$%&'()*+,-./
    0xFF, 0xFF, 0xFF, 0xFF, // 0x30–0x4F: 0123456789:;<=>?@
    0xFF, 0xFF, 0xFF, 0x7F // 0x50–0x7F: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdef...
    };

    while ((n = fread(chunk, 1, sizeof(chunk), f)) > 0) {
    for (i = 0; i < n; i++) {
    if (chunk[i] < 128 && !(MASK[chunk[i] >> 3] & (1 << (chunk[i] & 7)))) {
    fclose(f);
    return 0; // binary indicator found
    }
    // bytes >= 128 are accepted as probably text

    Typically a soi disant extended ASCII character set (e.g. ISO-8859-1)
    have the first 32 bytes starting at 128 defined as control characters.

    https://en.wikipedia.org/wiki/ISO/IEC_8859-1#Code_page_layout
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From bart@bc@freeuk.com to comp.lang.c on Wed Dec 10 22:37:48 2025
    From Newsgroup: comp.lang.c

    On 10/12/2025 19:42, Richard Heathfield wrote:
    On 10/12/2025 17:18, Scott Lurndal wrote:
    Michael S <already5chosen@yahoo.com> writes:
    On Wed, 10 Dec 2025 15:07:30 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael Sanders <porkchop@invalid.foo> writes:
    On Sat, 6 Dec 2025 02:00:22 -0000 (UTC), Lew Pitcher wrote:
    I should have added that I feel that you probably haven't really
    defined /what/ "text file" means, and that has interfered with
    the development of this function. As Keith pointed out, the task
    of distinguishing between a "text" file and a "binary" file is not >>>>>> easy. I'll add that a lot of the difficulty stems from the fact
    that there are many definitions (some conflicting) of what a "text" >>>>>> file actually contains.

    Yes. Here's my 2nd attempt following the template (of thinking)
    you've suggested...

    The problem with all of your attempts is the performance
    issue.  Success requires reading every single byte of the
    file, one byte at a time.   The word 'slow' is not sufficient
    to describe how bad the performance will be for a very large
    file.

    At a minimum, dump the stdio double-buffered byte-by-byte
    algorithm and use mmap().


    I suggest to do actual speed measurements before making bold
    claims like above. Don't trust your intuition!

    I have, more than once, done such measurements after mmap()
    was introduced in SVR4 circa 1989 (ported from SunOS).

    On a single-user system, running a single job, the difference
    for smaller files is in the noise.   For larger files, or when
    the system is heavily loaded or multiuser, it can be significant.

    1989 is 36 years ago. Technology has moved on. If reading your file is
    too slow to read, get yourself a real computer.

    On my very ordinary desktop machine, I just freq'd[1] a 7,032,963,565-
    byte file in 12.256 seconds. That's 573,838,410 bytes per second. It's a damn sight faster than I could do by hand.

    How, exactly, are you using `slow'?


    A getc loop took 4.3 seconds to read a 192MB file from SSD, on my
    Windows PC.

    Under WSL it took 8.4 seconds (8.4/0.5 real/user).

    However reading it all in one go took 0.14 seconds.

    I guess not all 'getc' implementations are the same.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Keith Thompson@Keith.S.Thompson+u@gmail.com to comp.lang.c on Wed Dec 10 15:20:19 2025
    From Newsgroup: comp.lang.c

    Michael Sanders <porkchop@invalid.foo> writes:
    On Tue, 09 Dec 2025 15:42:59 -0800, Keith Thompson wrote:

    [...]

    Keith if you get a chance see my reply to Lew 'is_text_file()'

    Let me know if I've inched closer a step or two...

    Closer to what exactly?

    In the parent article, I suggested that you likely don't need to
    determine whether a file is "text" or "binary". You said you want
    to parse a file. An attempt to parse it will fail either if the
    input is binary or if it's text that doesn't match the grammar you
    require. For example, a parser for C source code doesn't need to
    check whether the input is binary or text. Certain input
    characters will simply cause the parse to fail, and a syntax error
    can be reported. Tell us more about how you want to parse files.
    Are you parsing according to a formal grammar? Or is it more
    ad-hoc?
    --
    Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
    void Void(void) { Void(); } /* The recursive call of the void */
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Wed Dec 10 23:59:44 2025
    From Newsgroup: comp.lang.c

    On Wed, 10 Dec 2025 15:20:19 -0800, Keith Thompson wrote:

    Michael Sanders <porkchop@invalid.foo> writes:
    On Tue, 09 Dec 2025 15:42:59 -0800, Keith Thompson wrote:

    [...]

    Keith if you get a chance see my reply to Lew 'is_text_file()'

    Let me know if I've inched closer a step or two...

    Closer to what exactly?

    In the parent article, I suggested that you likely don't need to
    determine whether a file is "text" or "binary". You said you want
    to parse a file. An attempt to parse it will fail either if the
    input is binary or if it's text that doesn't match the grammar you
    require. For example, a parser for C source code doesn't need to
    check whether the input is binary or text. Certain input
    characters will simply cause the parse to fail, and a syntax error
    can be reported. Tell us more about how you want to parse files.
    Are you parsing according to a formal grammar? Or is it more
    ad-hoc?

    Yes I'm parsing a formal grammar (but a *really* small one).

    Yes I can parse binary/text just fine as you guessed.

    The matter at hand:

    I wanted to build a stand alone function that makes a solid guess as
    to whether a file would be considered an average text file or not.

    That's all...

    I've solved the issue to my satisfaction.
    --
    :wq
    Mike Sanders
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Thu Dec 11 01:09:59 2025
    From Newsgroup: comp.lang.c

    On Wed, 10 Dec 2025 22:07:24 GMT, Scott Lurndal wrote:

    Typically a soi disant extended ASCII character set (e.g. ISO-8859-1)
    have the first 32 bytes starting at 128 defined as control characters.

    https://en.wikipedia.org/wiki/ISO/IEC_8859-1#Code_page_layout

    Many thanks Scott. Here's my final stab at the idea.

    Beware word-wrap...

    #include <stdio.h>
    #include <stdint.h>

    /*
    * is_text_file()
    *
    * Determines whether a file is "probably text" or binary, using a heuristic
    * based on mostly printable characters.
    *
    * Detection modes:
    * TEXT_LOOSE - Allows ASCII printable bytes (0x20–0x7E), TAB/LF/CR,
    * and all high-bit bytes (>=128). Tolerant for UTF-8 or
    * ISO-8859-1 text.
    * TEXT_STRICT - Rejects ASCII control characters (0x00–0x08, 0x0B–0x0C,
    * 0x0E–0x1F) and C1 controls (0x80–0x9F). Counts only
    * clearly printable bytes.
    * TEXT_ISO8859_1 - Accepts ASCII printable (0x20–0x7E), ISO-8859-1
    * printable bytes (0xA0–0xFF), and TAB/LF/CR. Rejects
    * C1 controls (0x80–0x9F).
    *
    * Returns:
    * 1 file is probably text (>=90% printable characters)
    * 0 file is probably binary (too many non-printable characters)
    * -1 empty file
    * -2 could not open file
    */

    typedef enum {
    TEXT_LOOSE, // mostly printable: ASCII + high-bit
    TEXT_STRICT, // stricter: reject C1 controls
    TEXT_ISO8859_1 // ISO-8859-1 printable (0x20–0x7E + 0xA0–0xFF)
    } text_mode_t;

    static const uint8_t MASK[16] = {
    0x00, 0x24, 0x00, 0x00, // 0x00–0x0F: TAB(09), LF(0A), CR(0D)
    0xFF, 0xFF, 0xFF, 0xFF, // 0x10–0x2F: SPC!"#$%&'()*+,-./
    0xFF, 0xFF, 0xFF, 0xFF, // 0x30–0x4F: 0123456789:;<=>?@
    0xFF, 0xFF, 0xFF, 0x7F // 0x50–0x7F: A–Z [\]^_` a–z (exclude DEL)
    };

    int is_text_file(const char *path, text_mode_t mode) {
    FILE *f = fopen(path, "rb");
    if (!f) return -2;

    unsigned char chunk[4096];
    uint64_t n, i, good = 0, total = 0;

    while ((n = fread(chunk, 1, sizeof(chunk), f)) > 0) {
    total += n;

    for (i = 0; i < n; i++) {
    unsigned char c = chunk[i];

    switch (mode) {
    case TEXT_LOOSE:
    if (c >= 128 || (c < 128 && (MASK[c >> 3] & (1 << (c & 7))))) good++;
    break;

    case TEXT_STRICT: // reject C1 controls 0x80–0x9F
    if ((c >= 128 && c <= 159) || (c < 128 && !(MASK[c >> 3] & (1 << (c & 7))))) {
    // bad byte, do not count...
    } else good++;
    break;

    case TEXT_ISO8859_1: // accept 0x20–0x7E + 0xA0–0xFF, reject C1 controls
    if ((c >= 0x20 && c <= 0x7E) || (c >= 0xA0 && c <= 0xFF)
    || c == 0x09 || c == 0x0A || c == 0x0D) { good++; }
    break;
    }
    }
    }

    fclose(f);

    if (total == 0) return -1; // empty file

    return (good * 10 >= total * 9) ? 1 : 0; // 90% threshold
    }
    --
    :wq
    Mike Sanders
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Paul@nospam@needed.invalid to comp.lang.c on Wed Dec 10 22:35:53 2025
    From Newsgroup: comp.lang.c

    On Wed, 12/10/2025 5:37 PM, bart wrote:
    On 10/12/2025 19:42, Richard Heathfield wrote:
    On 10/12/2025 17:18, Scott Lurndal wrote:
    Michael S <already5chosen@yahoo.com> writes:
    On Wed, 10 Dec 2025 15:07:30 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael Sanders <porkchop@invalid.foo> writes:
    On Sat, 6 Dec 2025 02:00:22 -0000 (UTC), Lew Pitcher wrote:
    I should have added that I feel that you probably haven't really >>>>>>> defined /what/ "text file" means, and that has interfered with
    the development of this function. As Keith pointed out, the task >>>>>>> of distinguishing between a "text" file and a "binary" file is not >>>>>>> easy. I'll add that a lot of the difficulty stems from the fact
    that there are many definitions (some conflicting) of what a "text" >>>>>>> file actually contains.

    Yes. Here's my 2nd attempt following the template (of thinking)
    you've suggested...

    The problem with all of your attempts is the performance
    issue.  Success requires reading every single byte of the
    file, one byte at a time.   The word 'slow' is not sufficient
    to describe how bad the performance will be for a very large
    file.

    At a minimum, dump the stdio double-buffered byte-by-byte
    algorithm and use mmap().


    I suggest to do actual speed measurements before making bold
    claims like above. Don't trust your intuition!

    I have, more than once, done such measurements after mmap()
    was introduced in SVR4 circa 1989 (ported from SunOS).

    On a single-user system, running a single job, the difference
    for smaller files is in the noise.   For larger files, or when
    the system is heavily loaded or multiuser, it can be significant.

    1989 is 36 years ago. Technology has moved on. If reading your file is too slow to read, get yourself a real computer.

    On my very ordinary desktop machine, I just freq'd[1] a 7,032,963,565- byte file in 12.256 seconds. That's 573,838,410 bytes per second. It's a damn sight faster than I could do by hand.

    How, exactly, are you using `slow'?


    A getc loop took 4.3 seconds to read a 192MB file from SSD, on my Windows PC.

    Under WSL it took 8.4 seconds (8.4/0.5 real/user).

    However reading it all in one go took 0.14 seconds.

    I guess not all 'getc' implementations are the same.

    #include <stdio.h>
    #include <stdlib.h>
    #include <windows.h>

    /* gcc -Wl,--stack,1200000000 -o getcbench.exe getcbench.c */

    int main(int argc, char **argv)
    { FILE* source;

    int c; /* getc holder */
    const int size = 1000*1000*1000;
    char keep[size];
    int i=0;

    printf( "\nWelcome to getcbench.exe\n\n" );

    __int64 time1 = 0, time2 = 0, freq = 0; /* code added for timestamp */

    if (argc != 2) {
    fprintf(stderr, "Usage: %s source_file\n", argv[0]);
    return -1;
    }

    printf( "Array ready, opening file %s\n", argv[1] );

    source = fopen(argv[1], "rb");
    if (!source) {
    fprintf(stderr, "Could not open %s\n", argv[1]);
    return -1;
    }

    QueryPerformanceCounter((LARGE_INTEGER *) &time1); /* clock is running */
    QueryPerformanceFrequency((LARGE_INTEGER *)&freq);
    printf("time1 = %llX freq = %lld \n", time1, freq);

    while ((c = getc(source)) != EOF) {
    keep[i++] = c;
    if (i >= size) break;
    }

    QueryPerformanceCounter((LARGE_INTEGER *) &time2);
    printf("time2 = %llX \n", time2);

    printf("Read %d bytes in %010.6f seconds\n", i, (float)(time2-time1)/freq); }

    $ getcbench.exe D:\test.txt # D: is capable of gigabytes per second speeds

    Welcome to getcbench.exe

    Array ready, opening file D:test.txt
    time1 = 3380876B31 freq = 10000000
    time2 = 338D011DCC
    Read 1000000000 bytes in 020.930217 seconds # Process Monitor shows that 4096 byte reads are being done

    $

    ***************************************************************

    This has additional gubbins.

    https://en.cppreference.com/w/c/io/setvbuf

    Add some code after the fopen.

    if (setvbuf(source, NULL, _IOFBF, 65536) != 0)
    {
    fprintf(stderr, "setvbuf() failed\n\n" );
    return -1;
    }

    Process Monitor shows the reads now happen in 65536 chunks.

    But this does not do a thing for performance (with this style of I/O and no optimization).

    $ getcbenchbuf.exe D:\test.txt

    Welcome to getcbenchbuf.exe

    Array ready, opening file D:test.txt
    time1 = 37192A7827 freq = 10000000
    time2 = 37256FEAFA
    Read 1000000000 bytes in 020.587797 seconds

    ***************************************************************

    If I do this to the original program (-O2), it still is
    doing 4096 byte reads, but the performance is better.

    $ gcc -O2 -Wl,--stack,1200000000 -o getcbench.exe getcbench.c

    $ getcbench.exe D:\\test2.txt

    Welcome to getcbench.exe

    Array ready, opening file D:\test2.txt
    time1 = 3B4D7C1022 freq = 10000000
    time2 = 3B4E5EB775
    Read 1000000000 bytes in 001.485397 seconds

    Busy sum = FFFFFFFFE216FE9C

    Extra code was added so keep[] was not optimized away.

    for (k = 0; k<i; k++) sum += keep[k];
    printf("Busy sum = %llX\n", sum);

    That's about 673MB/sec.

    The version with the setvbuf, is still reading 65536 byte chunks.

    $ gcc -O2 -Wl,--stack,1200000000 -o getcbenchbuf.exe getcbenchbuf.c

    $ getcbenchbuf.exe D:\\test2.txt

    Welcome to getcbenchbuf.exe

    Array ready, opening file D:\test2.txt
    time1 = 3C1EA5ACDF freq = 10000000
    time2 = 3C1F49CE7D
    Read 1000000000 bytes in 001.075651 seconds

    Busy sum = FFFFFFFFE216FE9C

    That's getting close to a gigabyte per second.

    Summary: The -O2 makes a BIG difference.
    No idea how it is cheating.

    Paul
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From bart@bc@freeuk.com to comp.lang.c on Thu Dec 11 11:46:19 2025
    From Newsgroup: comp.lang.c

    On 11/12/2025 03:35, Paul wrote:
    On Wed, 12/10/2025 5:37 PM, bart wrote:

    A getc loop took 4.3 seconds to read a 192MB file from SSD, on my Windows PC.

    Under WSL it took 8.4 seconds (8.4/0.5 real/user).

    However reading it all in one go took 0.14 seconds.

    I guess not all 'getc' implementations are the same.

    #include <stdio.h>
    #include <stdlib.h>
    #include <windows.h>

    /* gcc -Wl,--stack,1200000000 -o getcbench.exe getcbench.c */

    int main(int argc, char **argv)
    { FILE* source;

    int c; /* getc holder */
    const int size = 1000*1000*1000;
    char keep[size];

    I didn't see the point of either keeping the array on the stack, or
    using a VLA. I made it static. That also allowed me a choice of
    compilers with no special options needed.

    Add some code after the fopen.

    if (setvbuf(source, NULL, _IOFBF, 65536) != 0)
    {
    fprintf(stderr, "setvbuf() failed\n\n" );
    return -1;
    }

    When I added that, it slowed it down! Maybe it was already using a
    bigger buffer.

    Extra code was added so keep[] was not optimized away.

    My loop didn't store the characters anywhere; it just bumped a count.

    I think it was enough that it was calling an external function, 'getc';
    a commpiler can't optimise that away.

    Read 1000000000 bytes in 001.075651 seconds

    Busy sum = FFFFFFFFE216FE9C

    That's getting close to a gigabyte per second.

    Summary: The -O2 makes a BIG difference.
    No idea how it is cheating.

    How a look at the generated assembly: is it still making an actual call
    to 'getc', or has it been inlined?

    In my case -O2 made little difference, and it was still calling getc().
    -O2 can't effect such a precompiled function, unless getc() is not
    really an external function: either a macro, or a wrapper.

    Also, the generated EXE file actually imports getc from msvcrt.dll,
    which is a library not known to be performant.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Bonita Montero@Bonita.Montero@gmail.com to comp.lang.c on Thu Dec 11 12:53:13 2025
    From Newsgroup: comp.lang.c

    Am 11.12.2025 um 12:46 schrieb bart:
    On 11/12/2025 03:35, Paul wrote:
    On Wed, 12/10/2025 5:37 PM, bart wrote:

    A getc loop took 4.3 seconds to read a 192MB file from SSD, on my
    Windows PC.

    Under WSL it took 8.4 seconds (8.4/0.5 real/user).

    However reading it all in one go took 0.14 seconds.

    I guess not all 'getc' implementations are the same.

    #include <stdio.h>
    #include <stdlib.h>
    #include <windows.h>

    /* gcc -Wl,--stack,1200000000  -o getcbench.exe getcbench.c */

    int main(int argc, char **argv)
    {  FILE* source;

        int c;                                      /* getc holder */
        const int size = 1000*1000*1000;
        char keep[size];

    I didn't see the point of either keeping the array on the stack, or
    using a VLA. I made it static. That also allowed me a choice of
    compilers with no special options needed.
    Yes. Under Linux/x64 the default stack size is 8MiB, unter Windows/x64
    one MiB.
    That's a stack overflow - or should I call it underflow since it grows downards
    - for sure.

    Add some code after the fopen.

        if (setvbuf(source, NULL, _IOFBF, 65536) != 0)
        {
             fprintf(stderr, "setvbuf() failed\n\n" );
             return -1;
        }

    When I added that, it slowed it down! Maybe it was already using a
    bigger buffer.

    Extra code was added so keep[] was not optimized away.

    My loop didn't store the characters anywhere; it just bumped a count.

    I think it was enough that it was calling an external function,
    'getc'; a commpiler can't optimise that away.

    Read 1000000000 bytes in 001.075651 seconds

    Busy sum = FFFFFFFFE216FE9C

    That's getting close to a gigabyte per second.

    Summary: The -O2 makes a BIG difference.
              No idea how it is cheating.

    How a look at the generated assembly: is it still making an actual
    call to 'getc', or has it been inlined?

    In my case -O2 made little difference, and it was still calling
    getc(). -O2 can't effect such a precompiled function, unless getc() is
    not really an external function: either a macro, or a wrapper.

    Also, the generated EXE file actually imports getc from msvcrt.dll,
    which is a library not known to be performant.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Thu Dec 11 12:33:14 2025
    From Newsgroup: comp.lang.c

    On Thu, 11 Dec 2025 01:09:59 -0000 (UTC), Michael Sanders wrote:

    [...]

    if (c >= 128 || (c < 128 && (MASK[c >> 3] & (1 << (c & 7))))) good++;

    [...]

    Thinking about it more, the bit-twiddling method while fast,
    is certainly not very readable/maintainable. Those who might
    want to use any of the variations I've written, will best be
    served using the one shown below. Not all the bells & whistles
    of the prior offering, but sometimes that's good thing.

    Note: If you keep map[] 'out in the open' (globally exposed)
    its only computed once at runtime instead everytime...

    Well off to work for me.

    #include <stdio.h>
    #include <stdint.h>

    /*
    * is_text_file()
    *
    * Determines whether a file is 'probably text' based on ISO-8859-1 rules.
    * Uses a precomputed lookup table for fast byte validation.
    *
    * Valid bytes:
    * - ASCII printable: 0x20–0x7E
    * - ISO-8859-1 high printable: 0xA0–0xFF
    * - Whitespace/control: TAB (0x09), LF (0x0A), CR (0x0D)
    *
    * Invalid bytes (binary indicators):
    * - NULL byte (0x00)
    * - C0 controls (0x01–0x08, 0x0B–0x0C, 0x0E–0x1F)
    * - DEL (0x7F)
    * - C1 controls (0x80–0x9F)
    *
    * Returns:
    * 1 - file is considered text
    * 0 - file is considered binary
    * -1 - could not open file
    */

    static const uint8_t map[256] = {
    0,0,0,0,0,0,0,0,0,1,1,0,0,1,0,0, // 00
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 10
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 20
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 30
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 40
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 50
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 60
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0, // 70
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 80
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 90
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // A0
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // B0
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // C0
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // D0
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // E0
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 // F0
    };

    int is_text_file(const char *path) {

    FILE *f = fopen(path, "rb");
    if (!f) return -1; // could not open file

    // larger chunk size means less 'touching' the drive
    unsigned char chunk[65536];
    size_t n, i;

    while ((n = fread(chunk, 1, sizeof(chunk), f)) > 0) {
    for (i = 0; i < n; i++) {
    if (!map[chunk[i]]) {
    fclose(f);
    return 0; // binary detected
    }
    }
    }

    fclose(f);
    return 1; // probally text
    }

    // eof
    --
    :wq
    Mike Sanders
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Kaz Kylheku@046-301-5902@kylheku.com to comp.lang.c on Thu Dec 11 17:33:43 2025
    From Newsgroup: comp.lang.c

    On 2025-12-06, Michael Sanders <porkchop@invalid.foo> wrote:
    Am I close? Missing anything you'd consider to be (or not) needed?

    Hi Michael,

    I contract for the the defense industry and badly need this function!

    I am working with proposed code like:

    if (is_binary_file(arg))
    launch_nuclear_strike();

    So I'm really sweating over the implementation, as you can imagine.

    This thread has been very helpful.

    I'm still leaning toward my paranoid functionw hich just checks that
    every bit of every byte is either 0 or 1 to confirm that the binary
    system is used.

    In the I/O error case, I will cautiously return a a true value; we would
    not want our side to lose due to a storage hardware issue.
    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Bonita Montero@Bonita.Montero@gmail.com to comp.lang.c on Thu Dec 11 19:10:03 2025
    From Newsgroup: comp.lang.c

    Please take my AVX-512 code.
    It's that fast that your nuclear strike hits first and you won't get hit
    by enemy.

    Am 11.12.2025 um 18:33 schrieb Kaz Kylheku:
    On 2025-12-06, Michael Sanders <porkchop@invalid.foo> wrote:
    Am I close? Missing anything you'd consider to be (or not) needed?
    Hi Michael,

    I contract for the the defense industry and badly need this function!

    I am working with proposed code like:

    if (is_binary_file(arg))
    launch_nuclear_strike();

    So I'm really sweating over the implementation, as you can imagine.

    This thread has been very helpful.

    I'm still leaning toward my paranoid functionw hich just checks that
    every bit of every byte is either 0 or 1 to confirm that the binary
    system is used.

    In the I/O error case, I will cautiously return a a true value; we would
    not want our side to lose due to a storage hardware issue.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.lang.c on Thu Dec 11 14:56:34 2025
    From Newsgroup: comp.lang.c

    On 12/11/2025 9:33 AM, Kaz Kylheku wrote:
    On 2025-12-06, Michael Sanders <porkchop@invalid.foo> wrote:
    Am I close? Missing anything you'd consider to be (or not) needed?

    Hi Michael,

    I contract for the the defense industry and badly need this function!

    I am working with proposed code like:

    if (is_binary_file(arg))
    launch_nuclear_strike();

    any launch_biotoxic_strike(...) in there?

    ;^) rofl.



    So I'm really sweating over the implementation, as you can imagine.

    This thread has been very helpful.

    I'm still leaning toward my paranoid functionw hich just checks that
    every bit of every byte is either 0 or 1 to confirm that the binary
    system is used.

    In the I/O error case, I will cautiously return a a true value; we would
    not want our side to lose due to a storage hardware issue.


    oh my! ;^D
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From James Kuyper@jameskuyper@alumni.caltech.edu to comp.lang.c on Thu Dec 11 18:15:15 2025
    From Newsgroup: comp.lang.c

    On 2025-12-11 12:33, Kaz Kylheku wrote:
    ...
    I'm still leaning toward my paranoid functionw hich just checks that
    every bit of every byte is either 0 or 1 to confirm that the binary
    system is used.
    I'd be very interested in seeing how you implement that test, and even
    more interested in what the test data looks like that you use to confirm
    that a failure of that test is correctly flagged. :-)

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.c on Fri Dec 12 02:19:17 2025
    From Newsgroup: comp.lang.c

    On 2025-12-11 18:33, Kaz Kylheku wrote:
    On 2025-12-06, Michael Sanders <porkchop@invalid.foo> wrote:
    Am I close? Missing anything you'd consider to be (or not) needed?

    Hi Michael,

    I contract for the the defense industry and badly need this function!

    I am working with proposed code like:

    if (is_binary_file(arg))
    launch_nuclear_strike();

    else
    negotiate_peace_conditions(arg);

    I think it's a waste of information to identify some 'arg' as text
    and not assume it to be a negotiation proposal for peace treaties!

    Or would that be considered just unnecessary feature creep? - Just
    bloating the code and having negative impact on runtime performance?
    (A few milliseconds could certainly make a difference here between
    victory or defeat!)


    [...]

    In the I/O error case, I will cautiously return a a true value; we would
    not want our side to lose due to a storage hardware issue.

    A very considerate decision. Kudos!

    Janis

    LOL - you made my day, Kaz!

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Fri Dec 12 19:25:41 2025
    From Newsgroup: comp.lang.c

    On Thu, 11 Dec 2025 12:33:14 -0000 (UTC), Michael Sanders wrote:

    static const uint8_t map[256] = {...

    added 'plugin' maps...

    #include <stdio.h>
    #include <stdint.h>

    /*
    * map_strict[]
    *
    * Valid bytes:
    * - ASCII printable: 0x20–0x7E
    * - ISO-8859-1 high printable: 0xA0–0xFF
    * - Whitespace/control: TAB (0x09), LF (0x0A), CR (0x0D)
    *
    * Invalid bytes (binary indicators):
    * - NULL byte (0x00)
    * - C0 controls (0x01–0x08, 0x0B–0x0C, 0x0E–0x1F)
    * - DEL (0x7F)
    * - C1 controls (0x80–0x9F)
    */

    static const uint8_t map_strict[256] = {
    0,0,0,0,0,0,0,0,0,1,1,0,0,1,0,0, // 00
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 10
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 20
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 30
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 40
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 50
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 60
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0, // 70
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 80
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 90
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // A0
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // B0
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // C0
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // D0
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // E0
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 // F0
    };

    /*
    * map_loose[]
    *
    * Valid bytes:
    * - ASCII printable characters: 0x20–0x7E
    * - Whitespace/control characters: TAB (0x09), LF (0x0A), CR (0x0D)
    * - High bytes: 0x80–0xFF
    *
    * Invalid bytes (binary indicators):
    * - NULL byte: 0x00
    * - C0 control codes: 0x01–0x08, 0x0B–0x0C, 0x0E–0x1F
    * - DEL character: 0x7F
    */

    static const uint8_t map_loose[256] = {
    0,0,0,0,0,0,0,0,0,1,1,0,0,1,0,0, // 00
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 10
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 20
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 30
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 40
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 50
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 60
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 70
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 80
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 90
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // A0
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // B0
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // C0
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // D0
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // E0
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 // F0
    };

    /*
    * is_text_file()
    *
    * just plugin in your own map[]...
    *
    * Returns:
    * 1 - text
    * 0 - binary
    * -1 - could not open
    */

    int is_text_file(const char *path, const uint8_t map[256]) {
    FILE *f = fopen(path, "rb");
    if (!f) return -1; // could not open file

    // 4KB: 4096, 8KB: 8192, 16KB: 16384, 32KB: 32768, 64KB: 65536
    unsigned char buf[65536];
    size_t n, i;

    while ((n = fread(buf, 1, sizeof(buf), f)) > 0) {
    for (i = 0; i < n; i++) {
    if (!map[buf[i]]) {
    fclose(f);
    return 0; // not text (binary indicators)
    }
    }
    }

    fclose(f);
    return 1; // probably text
    }

    // eof
    --
    :wq
    Mike Sanders
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Fri Dec 12 22:54:58 2025
    From Newsgroup: comp.lang.c

    On Fri, 12 Dec 2025 19:25:41 -0000 (UTC), Michael Sanders wrote:

    [...]

    Done.

    Features...

    - plugin maps
    - follows sylinks
    - rejects directories, devices, sockets

    #include <stdio.h>
    #include <stdint.h>
    #include <sys/stat.h>

    /*
    * map_strict[]
    *
    * Valid bytes:
    * - ASCII printable: 0x20–0x7E
    * - ISO-8859-1 high printable: 0xA0–0xFF
    * - Whitespace/control: TAB (0x09), LF (0x0A), CR (0x0D)
    *
    * Invalid bytes (binary indicators):
    * - NULL byte (0x00)
    * - C0 controls (0x01–0x08, 0x0B–0x0C, 0x0E–0x1F)
    * - DEL (0x7F)
    * - C1 controls (0x80–0x9F)
    */

    static const uint8_t map_strict[256] = {
    0,0,0,0,0,0,0,0,0,1,1,0,0,1,0,0, // 00
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 10
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 20
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 30
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 40
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 50
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 60
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0, // 70
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 80
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 90
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // A0
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // B0
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // C0
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // D0
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // E0
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 // F0
    };

    /*
    * map_loose[]
    *
    * Valid bytes:
    * - ASCII printable characters: 0x20–0x7E
    * - Whitespace/control characters: TAB (0x09), LF (0x0A), CR (0x0D)
    * - High bytes: 0x80–0xFF
    *
    * Invalid bytes (binary indicators):
    * - NULL byte: 0x00
    * - C0 control codes: 0x01–0x08, 0x0B–0x0C, 0x0E–0x1F
    * - DEL character: 0x7F
    */

    static const uint8_t map_loose[256] = {
    0,0,0,0,0,0,0,0,0,1,1,0,0,1,0,0, // 00
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 10
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 20
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 30
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 40
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 50
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 60
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 70
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 80
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 90
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // A0
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // B0
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // C0
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // D0
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // E0
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 // F0
    };

    /*
    * is_text_file()
    *
    * just plug in your own map[]...
    *
    * Returns:
    * 1 - text
    * 0 - binary indicator
    * -1 - could not open
    */

    int is_text_file(const char *path, const uint8_t map[256]) {

    // now we follow symlinks...
    struct stat st;
    if (stat(path, &st) != 0) return -1; // can not access file
    if (!S_ISREG(st.st_mode)) return -1; // reject: directories/devices/sockets

    FILE *f = fopen(path, "rb");
    if (!f) return -1; // could not open file

    // 4KB: 4096, 8KB: 8192, 16KB: 16384, 32KB: 32768, 64KB: 65536
    unsigned char buf[16384];
    size_t n, i;

    while ((n = fread(buf, 1, sizeof(buf), f)) > 0) {
    for (i = 0; i < n; i++) {
    if (!map[buf[i]]) {
    fclose(f);
    return 0; // not text (binary indicator detected)
    }
    }
    }

    fclose(f);
    return 1; // probally text
    }

    // eof
    --
    :wq
    Mike Sanders
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.lang.c on Fri Dec 12 15:33:01 2025
    From Newsgroup: comp.lang.c

    On 12/12/2025 2:54 PM, Michael Sanders wrote:
    [...]
    fclose(f);
    return 1; // probally text
    }

    define the probability? Say in 0...1?

    [...]

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Sat Dec 13 00:20:48 2025
    From Newsgroup: comp.lang.c

    On Fri, 12 Dec 2025 15:33:01 -0800, Chris M. Thomasson wrote:

    On 12/12/2025 2:54 PM, Michael Sanders wrote:
    [...]
    fclose(f);
    return 1; // probally text
    }

    define the probability? Say in 0...1?

    [...]

    Add it Chris & I'll roll it in =)

    Me? I'd go with steps of say, 10% just to
    make it human-friendly, but that's just me.
    --
    :wq
    Mike Sanders
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Sat Dec 13 02:32:57 2025
    From Newsgroup: comp.lang.c

    On Sat, 13 Dec 2025 00:20:48 -0000 (UTC), Michael Sanders wrote:

    On Fri, 12 Dec 2025 15:33:01 -0800, Chris M. Thomasson wrote:

    On 12/12/2025 2:54 PM, Michael Sanders wrote:
    [...]
    fclose(f);
    return 1; // probally text
    }

    define the probability? Say in 0...1?

    [...]

    Add it Chris & I'll roll it in =)

    Me? I'd go with steps of say, 10% just to
    make it human-friendly, but that's just me.

    just thinking out loud about probabilities...

    int is_text_file(const char *path, const uint8_t map[256], int probability)
    --
    :wq
    Mike Sanders
    --- Synchronet 3.21a-Linux NewsLink 1.2