Forum: War Ensemble BBS

is_binary_file()

From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Sat Dec 6 01:05:44 2025

From Newsgroup: comp.lang.c

Am I close? Missing anything you'd consider to be (or not) needed?

<stdio.h>

/*
* Checks if a file is likely a binary by examining its content
* for NULL bytes (0x00) or unusual control characters.
* Returns 0 if text, 1 if binary or file open failure.
*/

int is_binary_file(const char *path) {
FILE *f = fopen(path, "rb");
if (!f) return 1; // cannot open file, treat as error/fail check

unsigned char buf[65536];
size_t n, i;

while ((n = fread(buf, 1, sizeof(buf), f)) > 0) {
for (i = 0; i < n; i++) {
unsigned char c = buf[i];

// 1. check for the NULL byte (strong indicator of binary data)
if (c == 0x00) {
fclose(f);
return 1; // IS binary
}

// 2. check for C0 control codes (0x01-0x1F), excluding known
// text formatting characters: 0x09 (Tab), 0x0A (LF), 0x0D (CR)
if (c < 0x20) {
if (c != 0x09 && c != 0x0A && c != 0x0D) {
fclose(f);
return 1; // IS binary (contains unexpected control code)
}
}
}
}

fclose(f);
return 0; // NOT binary
}
--
:wq
Mike Sanders
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lew Pitcher@lew.pitcher@digitalfreehold.ca to comp.lang.c on Sat Dec 6 01:41:28 2025

From Newsgroup: comp.lang.c

On Sat, 06 Dec 2025 01:05:44 +0000, Michael Sanders wrote:

Am I close? Missing anything you'd consider to be (or not) needed?

<stdio.h>

/*
* Checks if a file is likely a binary by examining its content
* for NULL bytes (0x00) or unusual control characters.
* Returns 0 if text, 1 if binary or file open failure.
*/

First off, until we get computers that store file data in formats
other than binary, /all/ files (text or not) are "binary" files
(meaning that an is_binary_file() function should always return true).
OTOH, "text files" are a distinguishable subset of binary files.
I suggest that this makes an "is_text_file()" function more valuable
and more fitting than an "is_binary_file()" function.

Secondly, ISTM that the function should return a unique failure value
rather than overload the "is binary" return value. After all, you
actually have three return values: is_text, is_not_text, and
is_indeterminate (because of file access failure).

Thirdly, your determination of whether or not the file contains text
seemingly depends only on the existence or absence of certain control characters. But text isn't just control characters; so you need a test
for invalid non-control characters as well. And, IIRC, not all control characters occupy the ASCII/Unicode C0 band, so you might have to expand
your "acceptable control character" test to include some of those other
control codes.

Finally, you've hardcoded the binary values for certain acceptable ASCII/Unicode control characters. However, not all platforms use ASCII
or Unicode, and these tests would fail to test the corresponding character value correctly (I think here of EBCDIC, where "Line Feed" doesn't exist
but it's equivalent "NewLine" is 0x15 and Horizontal Tab is 0x05). Better
here to use the C equivalent escape characters '\n' and '\t' instead.
You may also consider expanding the control-character test to include other line-formatting characters (at least as far as C will allow): Vertical Tab ('\v'), Form Feed ('\f'), Carriage Return ('\r') and Backspace ('\b').

int is_binary_file(const char *path) {
FILE *f = fopen(path, "rb");
if (!f) return 1; // cannot open file, treat as error/fail check

unsigned char buf[65536];
size_t n, i;

while ((n = fread(buf, 1, sizeof(buf), f)) > 0) {
for (i = 0; i < n; i++) {
unsigned char c = buf[i];

// 1. check for the NULL byte (strong indicator of binary data)
if (c == 0x00) {
fclose(f);
return 1; // IS binary
}

// 2. check for C0 control codes (0x01-0x1F), excluding known
// text formatting characters: 0x09 (Tab), 0x0A (LF), 0x0D (CR)
if (c < 0x20) {
if (c != 0x09 && c != 0x0A && c != 0x0D) {
fclose(f);
return 1; // IS binary (contains unexpected control code)
}
}
}
}

fclose(f);
return 0; // NOT binary
}

--
Lew Pitcher
"In Skills We Trust"
Not LLM output - I'm just like this.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Keith Thompson@Keith.S.Thompson+u@gmail.com to comp.lang.c on Fri Dec 5 17:42:30 2025

From Newsgroup: comp.lang.c

Michael Sanders <porkchop@invalid.foo> writes:

Am I close? Missing anything you'd consider to be (or not) needed?

There is no completely reliable way to do this, but you might be
able to make a reasonable guess. A binary file might happen to
contain only byte values that represent printable characters.

<stdio.h>

/*
* Checks if a file is likely a binary by examining its content
* for NULL bytes (0x00) or unusual control characters.
* Returns 0 if text, 1 if binary or file open failure.
*/

Please use the term "null bytes", not "NULL bytes". NULL is a standard
macro that expands to a null pointer constant.

int is_binary_file(const char *path) {
FILE *f = fopen(path, "rb");
if (!f) return 1; // cannot open file, treat as error/fail check

It seems odd to say that a file is assumed to be binary if you can't
open it. I suggest having the function return more than two distinct
values:

- File seems to be binary
- File seems to be text
- Could be either
- Something went wrong

An enum is probably a good choice.

unsigned char buf[65536];
size_t n, i;

while ((n = fread(buf, 1, sizeof(buf), f)) > 0) {

Since you're only looking at individual characters, you might as well
read one character at a time. The stdio functions will buffer the input
for you, so there won't be much loss of performance.

for (i = 0; i < n; i++) {
unsigned char c = buf[i];

// 1. check for the NULL byte (strong indicator of binary
data)

"null byte", not "NULL byte".

if (c == 0x00) {
fclose(f);
return 1; // IS binary
}

// 2. check for C0 control codes (0x01-0x1F), excluding known
// text formatting characters: 0x09 (Tab), 0x0A (LF), 0x0D (CR)
if (c < 0x20) {
if (c != 0x09 && c != 0x0A && c != 0x0D) {

This test will detect '\0' bytes, making your first check redundant.

fclose(f);
return 1; // IS binary (contains unexpected control code)
}

You're assuming an ASCII-based character set, which is very
probably a safe assumption. But I'd suggest replacing most of
the hex constants with character constants. Aside from being more
portable (realistically EBCDIC systems are the only case where it
will matter), it makes the code more readable. And things like
UTF-8 and UTF-16 make things a lot more complicated.

0x00 -> '\0'
0x20 -> ' '
0x09 -> '\t'
0x0A -> '\n'
0x0D -> '\r'

}
}
}

fclose(f);

fclose(f) can fail. That's not likely, but you should check.

return 0; // NOT binary
}

You treat an empty file as text. That's not entirely unreasonable,
but you should at least document it.

You assume that a binary file is one that contains any byte values
in the range 0..31 other than '\t', '\n', and '\r'. So a "text"
file can't contain formfeed characters (debatable), but it can
contain DEL characters and anything above 127.

For Latin-1, values from 0xa0 to 0xff are printable (0xa0 is
NO-BREAK SPACE, so that might be debatable). For UTF-8, bytes with
values 0x80 and higher can be valid, but only in certain contexts.
And so on.

Depending on how far you want to get into it, distinguishing between
text and binary files is anywhere from difficult to literally
impossible.

Take a look at the "file" command.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lew Pitcher@lew.pitcher@digitalfreehold.ca to comp.lang.c on Sat Dec 6 02:00:22 2025

From Newsgroup: comp.lang.c

On Sat, 06 Dec 2025 01:41:28 +0000, Lew Pitcher wrote:

On Sat, 06 Dec 2025 01:05:44 +0000, Michael Sanders wrote:

Am I close? Missing anything you'd consider to be (or not) needed?

<stdio.h>

/*
* Checks if a file is likely a binary by examining its content
* for NULL bytes (0x00) or unusual control characters.
* Returns 0 if text, 1 if binary or file open failure.
*/

First off, until we get computers that store file data in formats
other than binary, /all/ files (text or not) are "binary" files
(meaning that an is_binary_file() function should always return true).
OTOH, "text files" are a distinguishable subset of binary files.
I suggest that this makes an "is_text_file()" function more valuable
and more fitting than an "is_binary_file()" function.

Secondly, ISTM that the function should return a unique failure value
rather than overload the "is binary" return value. After all, you
actually have three return values: is_text, is_not_text, and is_indeterminate (because of file access failure).

[snip]

I should have added that I feel that you probably haven't really
defined /what/ "text file" means, and that has interfered with
the development of this function. As Keith pointed out, the task
of distinguishing between a "text" file and a "binary" file is not
easy. I'll add that a lot of the difficulty stems from the fact
that there are many definitions (some conflicting) of what a "text"
file actually contains.

The best advice I can give here is that you should pick a definition
of what a text file consists of, document /that/ definition, and
use /that/ documentation to build your code. If you say that, for
instance, EBCDIC is out of scope, then your code does not have to
handle EBCDIC (but if you /don't/ say that, then you leave your code
open to the ambiguity of whether or not it will work with EBCDIC).
Likewise for ASCII or "Extended ASCII" (sic) or Unicode (or 6Bit
(multiple different choices here) or Baudot or even Morse).

With suitable definitions beforehand, you can write an acceptable "is_text_file()" function and/or a passable "is_binary_file()"
function.

HTH
--
Lew Pitcher
"In Skills We Trust"
Not LLM output - I'm just like this.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Kaz Kylheku@046-301-5902@kylheku.com to comp.lang.c on Sat Dec 6 02:42:39 2025

From Newsgroup: comp.lang.c

On 2025-12-06, Michael Sanders <porkchop@invalid.foo> wrote:

Am I close? Missing anything you'd consider to be (or not) needed?

<stdio.h>

/*
* Checks if a file is likely a binary by examining its content
* for NULL bytes (0x00) or unusual control characters.
* Returns 0 if text, 1 if binary or file open failure.
*/

int is_binary_file(const char *path) {

[ ... ]

fclose(f);
return 0; // NOT binary
}

How about:

int is_binary_file(const char *path)
{
FILE *f = fopen(path);
int yes = 0;

if (f) {
int ch;

while ((ch == getc(f)) != EOF) {
for (int i = 0; i < CHAR_BIT; i++, ch >>= 1) {
switch ((ch & 1)) {
case 0:
case 1:
break;
default:
goto out;
}
}
}

// TODO: distinguish feof/ferror
yes = 1;
out:

fclose(f);
}

return yes;
}
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca
--- Synchronet 3.21a-Linux NewsLink 1.2

From Paul@nospam@needed.invalid to comp.lang.c on Sat Dec 6 03:14:55 2025

From Newsgroup: comp.lang.c

On Fri, 12/5/2025 8:05 PM, Michael Sanders wrote:

Am I close? Missing anything you'd consider to be (or not) needed?

<stdio.h>

/*
* Checks if a file is likely a binary by examining its content
* for NULL bytes (0x00) or unusual control characters.
* Returns 0 if text, 1 if binary or file open failure.
*/

int is_binary_file(const char *path) {
FILE *f = fopen(path, "rb");
if (!f) return 1; // cannot open file, treat as error/fail check

unsigned char buf[65536];
size_t n, i;

while ((n = fread(buf, 1, sizeof(buf), f)) > 0) {
for (i = 0; i < n; i++) {
unsigned char c = buf[i];

// 1. check for the NULL byte (strong indicator of binary data)
if (c == 0x00) {
fclose(f);
return 1; // IS binary
}

// 2. check for C0 control codes (0x01-0x1F), excluding known
// text formatting characters: 0x09 (Tab), 0x0A (LF), 0x0D (CR)
if (c < 0x20) {
if (c != 0x09 && c != 0x0A && c != 0x0D) {
fclose(f);
return 1; // IS binary (contains unexpected control code)
}
}
}
}

fclose(f);
return 0; // NOT binary
}

It is the year 2025.

How many times do you suppose someone has considered this question ?

I'm not trying to be a smart ass by saying this, just that the
question is bound to be nuanced. You can do a fast and totally
inaccurate determination. You can do a computationally expensive
or I/O expensive determination.

There has to be a reason for doing this, and a damn good reason.

*******

There is the "file" command.

It was invented in 1973.

https://en.wikipedia.org/wiki/File_%28command%29

The beauty of this command, is it has some sort of ordered
approach to file determination.

Originally, as I understand it (I don't see it in the Wiki), it
was not supposed to read more than 1024 bytes of the file. This
was because the command was intended to settle file determinations
for "ordered types". For example, an MSWord doc, might have four
unique bytes near the beginning of the file. The designers felt
they could quickly "sort" or "determine" what kind of highly
stylized file they were dealing with.

But the results I got one day a couple years ago, suggests
they have strayed from that. I got around 100 different text
file declarations. For example, a text file with a binary block
in it as a "corruption", it is declared as a text file, but
the word "ISO something or other" is part of the file type
determination. Thus, when I see a certain file on my computer
is no longer a plain text file, but contains the word ISO,
then I must scroll through it with a hex editor and see what
the hell has triggered this determination.

The experience suggested the entire text file was being read.
I did not craft any tests to see if that was true.

Some file types receive very little differentiation. There is
only the one detection for them, the detection offers no help
for technical people.

That's an exemplar of a still-supported effort to identify files.
The "file" command. It does not rely upon, or use, the extension.

And those people are wizards. You can't expect to just read their
source and make some instant discovery. Sometimes, when someone
asks for a new detection, the wizards know of some dependencies
in the detection tree that prevent the craftsmanship necessary.
Mere mortals need not apply while this is going on.

To find 100 different text file types, I un-tarred the Firefox
source tarball and scanned it, then used AWK to total the
various detections and print them out. I only used the AWK
code, after being shocked to find what a shithole the tarball was.
I had originally intended to run UNIX2DOS over the thing, but
that was entirely out of the question when the detections
came in. In fact, there is just one source file in the Firefox
tree, that you MUST NOT alter. It breaks the build, if you do
ANYTHING to it. Good times. I could not figure out why gcc
had such a problem with the file. Could not root cause it.

*******

As a little example, I will scan the Sent file of my News Client,
which I happen to know is corrupted, but I haven't bothered to
fix it yet. And how I detected the corruption in the first place,
was by running this!

$ File Sent
Sent: Non-ISO extended-ASCII text, with very long lines, with CRLF, NEL line terminators

That is a corrupt one.

$ File Trash
Trash: ASCII text, with CRLF line terminators

That is not corrupt.

$ dd if=/dev/urandom of=big.bin bs=1048576 count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 7.44362 s, 144 MB/s

$ file big.bin
big.bin: data <=== Not definitive, as even trivially distorted files do this.
This file just happens to be "perfectly undetectable".

A file full of zeros, is also "data". There is no special detection for it.

Paul
--- Synchronet 3.21a-Linux NewsLink 1.2

From bart@bc@freeuk.com to comp.lang.c on Sat Dec 6 12:42:53 2025

From Newsgroup: comp.lang.c

On 06/12/2025 02:42, Kaz Kylheku wrote:

On 2025-12-06, Michael Sanders <porkchop@invalid.foo> wrote:

Am I close? Missing anything you'd consider to be (or not) needed?

<stdio.h>

/*
* Checks if a file is likely a binary by examining its content
* for NULL bytes (0x00) or unusual control characters.
* Returns 0 if text, 1 if binary or file open failure.
*/

int is_binary_file(const char *path) {

[ ... ]

fclose(f);
return 0; // NOT binary
}

How about:

int is_binary_file(const char *path)
{
FILE *f = fopen(path);
int yes = 0;

if (f) {
int ch;

while ((ch == getc(f)) != EOF) {
for (int i = 0; i < CHAR_BIT; i++, ch >>= 1) {
switch ((ch & 1)) {
case 0:
case 1:
break;
default:

If this is suppposed to detect files which don't consist of binary
characters (for example each ch has CHAR_BIT quaternary digits) then I
don't believe this will detect that.

Assumung that 'ch & 1' is equivalent to 'ch % 2' in that case.

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.lang.c on Sat Dec 6 17:33:08 2025

From Newsgroup: comp.lang.c

Michael Sanders <porkchop@invalid.foo> writes:

Am I close? Missing anything you'd consider to be (or not) needed?

Technically, there is no such thing as a "binary" file. All files
are simply sequences of bytes with no format implied. Interpretation
of the file content is purely application dependent.

C-based applications have certain restrictions on text format
due to the use of the ASCII NUL code as a string terminator, but
that's C. The content of a text file processed by a different
language, or by C using application-defined string containers
can easily contain a NUL byte yet still be considered "text"
if that distinction is necessary.

Because of C/C++, a valid UTF-8 encoding will not include
the NUL byte.
--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.lang.c on Sat Dec 6 17:37:11 2025

From Newsgroup: comp.lang.c

Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:

Michael Sanders <porkchop@invalid.foo> writes:

Am I close? Missing anything you'd consider to be (or not) needed?

There is no completely reliable way to do this, but you might be
able to make a reasonable guess. A binary file might happen to
contain only byte values that represent printable characters.

<stdio.h>

/*
* Checks if a file is likely a binary by examining its content
* for NULL bytes (0x00) or unusual control characters.
* Returns 0 if text, 1 if binary or file open failure.
*/

Please use the term "null bytes", not "NULL bytes". NULL is a standard
macro that expands to a null pointer constant.

The proper term IMO is 'NUL' byte as defined by ASCII.

Some older operating systems actually stored the file type in
metadata (like the unix inode). The Burroughs MCP filesystems
included a file-type field in the metadata for a file; the CANDE editor
would use this to determine the programming language (and the associated language formatting rules a la COBOL or FORTRAN vis-a-vis column
assignments for the sequence number, program verbs, etc.
--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.lang.c on Sat Dec 6 17:40:18 2025

From Newsgroup: comp.lang.c

Kaz Kylheku <046-301-5902@kylheku.com> writes:

On 2025-12-06, Michael Sanders <porkchop@invalid.foo> wrote:

Am I close? Missing anything you'd consider to be (or not) needed?

<stdio.h>

/*
* Checks if a file is likely a binary by examining its content
* for NULL bytes (0x00) or unusual control characters.
* Returns 0 if text, 1 if binary or file open failure.
*/

int is_binary_file(const char *path) {

[ ... ]

fclose(f);
return 0; // NOT binary
}

How about:

int is_binary_file(const char *path)
{
FILE *f = fopen(path);

if (f) {

while (isprint(getc(f)) {}
return (!feof(f));

}
return 0;
}
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lew Pitcher@lew.pitcher@digitalfreehold.ca to comp.lang.c on Sat Dec 6 18:04:00 2025

From Newsgroup: comp.lang.c

On Sat, 06 Dec 2025 17:40:18 +0000, Scott Lurndal wrote:

Kaz Kylheku <046-301-5902@kylheku.com> writes:

On 2025-12-06, Michael Sanders <porkchop@invalid.foo> wrote:

Am I close? Missing anything you'd consider to be (or not) needed?

<stdio.h>

/*
* Checks if a file is likely a binary by examining its content
* for NULL bytes (0x00) or unusual control characters.
* Returns 0 if text, 1 if binary or file open failure.
*/

int is_binary_file(const char *path) {

[ ... ]

fclose(f);
return 0; // NOT binary
}

How about:

int is_binary_file(const char *path)
{
FILE *f = fopen(path);

if (f) {

while (isprint(getc(f)) {}

The isprint function tests for any member of a locale-specific
set of characters (each of which occupies one printing position
on a display device) including space (' ').

It effectively evaluates whether or not a given value is a
"printing character" in the execution characterset, not whether
or not a given value (from an outside file) is a text character.

I'd use this function cautiously, as it will produce false
results when the characterset of the source data is not the the
execution characterset (think a Unicode UTF16 encoded text
file, and an ASCII execution characterset).

return (!feof(f));

}
return 0;
}

--
Lew Pitcher
"In Skills We Trust"
Not LLM output - I'm just like this.
--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.lang.c on Sat Dec 6 19:06:14 2025

From Newsgroup: comp.lang.c

Lew Pitcher <lew.pitcher@digitalfreehold.ca> writes:

On Sat, 06 Dec 2025 17:40:18 +0000, Scott Lurndal wrote:

Kaz Kylheku <046-301-5902@kylheku.com> writes:

On 2025-12-06, Michael Sanders <porkchop@invalid.foo> wrote:

Am I close? Missing anything you'd consider to be (or not) needed?

<stdio.h>

/*
* Checks if a file is likely a binary by examining its content
* for NULL bytes (0x00) or unusual control characters.
* Returns 0 if text, 1 if binary or file open failure.
*/

int is_binary_file(const char *path) {

[ ... ]

fclose(f);
return 0; // NOT binary
}

How about:

int is_binary_file(const char *path)
{
FILE *f = fopen(path);

if (f) {

while (isprint(getc(f)) {}

The isprint function tests for any member of a locale-specific
set of characters (each of which occupies one printing position
on a display device) including space (' ').

It effectively evaluates whether or not a given value is a
"printing character" in the execution characterset, not whether
or not a given value (from an outside file) is a text character.

What is your definition of a "text" character?

I'd use this function cautiously, as it will produce false
results when the characterset of the source data is not the the
execution characterset (think a Unicode UTF16 encoded text
file, and an ASCII execution characterset).

return (!feof(f));

}
return 0;
}

--
Lew Pitcher
"In Skills We Trust"
Not LLM output - I'm just like this.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Lew Pitcher@lew.pitcher@digitalfreehold.ca to comp.lang.c on Sat Dec 6 21:16:02 2025

From Newsgroup: comp.lang.c

On Sat, 06 Dec 2025 19:06:14 +0000, Scott Lurndal wrote:

Lew Pitcher <lew.pitcher@digitalfreehold.ca> writes:

On Sat, 06 Dec 2025 17:40:18 +0000, Scott Lurndal wrote:

Kaz Kylheku <046-301-5902@kylheku.com> writes:

On 2025-12-06, Michael Sanders <porkchop@invalid.foo> wrote:

Am I close? Missing anything you'd consider to be (or not) needed?

<stdio.h>

/*
* Checks if a file is likely a binary by examining its content
* for NULL bytes (0x00) or unusual control characters.
* Returns 0 if text, 1 if binary or file open failure.
*/

int is_binary_file(const char *path) {

[ ... ]

fclose(f);
return 0; // NOT binary
}

How about:

int is_binary_file(const char *path)
{
FILE *f = fopen(path);

if (f) {

while (isprint(getc(f)) {}

The isprint function tests for any member of a locale-specific
set of characters (each of which occupies one printing position
on a display device) including space (' ').

It effectively evaluates whether or not a given value is a
"printing character" in the execution characterset, not whether
or not a given value (from an outside file) is a text character.

What is your definition of a "text" character?

I have none, for this case. However, the OP /might/ have one, given
that his code was an attempt to discern "text" files from "binary"
files.

I'd use this function cautiously, as it will produce false
results when the characterset of the source data is not the the
execution characterset (think a Unicode UTF16 encoded text
file, and an ASCII execution characterset).

return (!feof(f));

}
return 0;
}

--
Lew Pitcher
"In Skills We Trust"
Not LLM output - I'm just like this.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Keith Thompson@Keith.S.Thompson+u@gmail.com to comp.lang.c on Sat Dec 6 16:05:45 2025

From Newsgroup: comp.lang.c

scott@slp53.sl.home (Scott Lurndal) writes:

Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:

[...]

Please use the term "null bytes", not "NULL bytes". NULL is a standard >>macro that expands to a null pointer constant.

The proper term IMO is 'NUL' byte as defined by ASCII.

That's *a* proper term. It's not the only one.

Both ASCII and EBCDIC use the term "NUL" for the character value
with all bits set to zero, but C doesn't assume either ASCII or
EBCDIC and doesn't use the name "NUL". The standard uses the term
"null character", which is technically correct but might not be
ideal to refer to a byte in a file whose contents aren't intended
to represent characters.

I have no problem with the term "NUL", "NUL byte", or "NUL
character", but personally I tend to prefer "null byte", "zero byte",
or '\0'.

[...]
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */
--- Synchronet 3.21a-Linux NewsLink 1.2

From James Kuyper@jameskuyper@alumni.caltech.edu to comp.lang.c on Sat Dec 6 20:37:22 2025

From Newsgroup: comp.lang.c

On 2025-12-05 20:05, Michael Sanders wrote:

Am I close? Missing anything you'd consider to be (or not) needed?

<stdio.h>

/*
* Checks if a file is likely a binary by examining its content
* for NULL bytes (0x00) or unusual control characters.

NULL is a macro that expands to a null pointer constant. I think you
mean "null character". This isn't just nit-picking -C is a
case-sensitive language, so it's essential to pay attention to case.

* Returns 0 if text, 1 if binary or file open failure.
*/

You should return a distinct value for file open failure - a file that
cannot be opened cannot be determined to be either a text or a binary file.

You really cannot distinguish with certainty whether a file is a text
file or a binary file based solely upon the contents. A file whose
format is an array of two-byte 2's complement little-endian integers
would normally be considered binary, yet it might happen to contain
integers whose bytes all happen to be printable characters.

The standard does not define what a "binary file" is. However, it does
provide a promise that applies only to streams in text mode, which
depends upon what was written to that file:

"Data read in from a text stream will necessarily compare equal to the
data that were earlier written out to that stream only if: the data
consist only of printing characters and the control characters
horizontal tab and new-line; no new-line character is immediately
preceded by space characters; and the last character is a new-line
character." (7.23.2p2).

I believe it therefore makes sense to consider something to be a text
file if it meets those requirements, and otherwise is a binary file.
Note that the last requirement implies that an empty file cannot qualify
as text - at a minimum, it must contain a new-line character.

This implies the use of the isprint() function; the only other
characters you need to handle specifically are '\t', '\n', and ' '.
Since the result returned by isprint() is locale-dependent, the program
should, at least optionally, use setlocale().
--- Synchronet 3.21a-Linux NewsLink 1.2

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.lang.c on Sun Dec 7 03:43:58 2025

From Newsgroup: comp.lang.c

Michael Sanders <porkchop@invalid.foo> wrote:

Am I close? Missing anything you'd consider to be (or not) needed?

You miss definition: you should first decide what you consider to
be a binary file (this is hard part). You may wish consider
my experience many years ago: I looked at problem reports about
SUN OS. Those were considered text files, in total about 160 MB.
For my purposes it would be convenient to find character code _not_
appearing in those files. But checking found that the only code
which did not appear were 0. Report were mostly in English,
but there were non-English pieces contributing international
characters. There were handful of box-drawing characters.
There were (I think stray) control codes.

You can take from this that zero code was strong indicator of
non-text file. But do you consider UTF-16 encode text as binary?
Note that such text is likely to contain a lot of zero bytes.
Any byte different than zero will appear in a file considered by
its author to be a text file as long as you take large enough
sample.

If you have few hundred of characters from a file you can apply
a reasonably simple statistical test to decide if text came from
one of popular human langages and if yes test will tell you the
language.

For security puprose you may wish to check if a file oly contains
safe codes. But definition of "safe" depends on application.
In US context you could decide that anything outside printable
ASCII + newline is unsafe. Or you may add to this some selected
contol codes like tabs. In international context you probably
need to allow relevant national character codes, which depends
on specific environment.
--
Waldek Hebisch
--- Synchronet 3.21a-Linux NewsLink 1.2

From Louis Krupp@lkrupp@invalid.pssw.com.invalid to comp.lang.c on Sun Dec 7 03:43:40 2025

From Newsgroup: comp.lang.c

On 12/6/2025 10:37 AM, Scott Lurndal wrote:

<snip>

Some older operating systems actually stored the file type in
metadata (like the unix inode). The Burroughs MCP filesystems
included a file-type field in the metadata for a file; the CANDE editor
would use this to determine the programming language (and the associated language formatting rules a la COBOL or FORTRAN vis-a-vis column
assignments for the sequence number, program verbs, etc.

The Burroughs file attribute name was "FILEKIND," and it took values
like ALGOLSYMBOL (for an ALGOL source file) and ALGOLCODE (for an
executable compiled with ALGOL). Other file attributes included maximum
record length, character encoding (e.g. ASCII or EBCDIC), and lots more.

This brings back memories, most of them fond.

As far as I can tell, UNISYS MCP systems still have all that:

https://public.support.unisys.com/aseries/docs/ClearPath-MCP-19.0/86000064-520/86000064-520/chapter-000002094.html

Louis

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.lang.c on Sun Dec 7 16:47:19 2025

From Newsgroup: comp.lang.c

Louis Krupp <lkrupp@invalid.pssw.com.invalid> writes:

On 12/6/2025 10:37 AM, Scott Lurndal wrote:

<snip>

Some older operating systems actually stored the file type in
metadata (like the unix inode). The Burroughs MCP filesystems
included a file-type field in the metadata for a file; the CANDE editor
would use this to determine the programming language (and the associated
language formatting rules a la COBOL or FORTRAN vis-a-vis column
assignments for the sequence number, program verbs, etc.

The Burroughs file attribute name was "FILEKIND," and it took values
like ALGOLSYMBOL (for an ALGOL source file) and ALGOLCODE (for an
executable compiled with ALGOL). Other file attributes included maximum >record length, character encoding (e.g. ASCII or EBCDIC), and lots more.

This brings back memories, most of them fond.

As far as I can tell, UNISYS MCP systems still have all that:

https://public.support.unisys.com/aseries/docs/ClearPath-MCP-19.0/86000064-520/86000064-520/chapter-000002094.html

Yes the A-series (Large Systems) emulated systems still have
all that.

The V-series (long defunct) also supported a file kind attribute
for CANDE files.

--------------------------------------------------------------------------------
CAT
C A T A L O G
Usercode: 9895 Filetitle: ====qn on HOME As of 12/07/25 08:35:22 Pg 01

gemcqn SYS Record-size = 600 RPB = 1 Areas 0 EOF 716 LOCALSPO
w15eqn SYS Record-size = 160 RPB = 90 Areas 0 EOF 322 LURNDAL AAAAqn BPL 09/28/89 10:18:13 5Rec(s) Pub IO 9895 ADDMqn BPL 04/14/87 19:46:19 10Rec(s) Pri IO 9895 ADDUqn BPL 03/03/89 17:04:42 21Rec(s) Pub IO 9895 ADSSqn BPL 06/28/89 14:15:24 7999Rec(s) Pub IO 9895 AHWAqn SPRITE 10/09/89 17:53:23 23Rec(s) Pub IO 9895 AIFAqn BPL 10/12/89 15:15:29 8Rec(s) Pub IO 9895 AIVAqn BPL 10/20/89 16:21:33 4Rec(s) Pub IO 9895 APBPqn BPL 01/11/89 18:05:38 92Rec(s) Pub IO 9895 ARCVqn BPL 04/10/89 10:47:00 376Rec(s) Grd IO 9895 BACKqn BPL 09/09/89 03:55:03 576Rec(s) Pub IO 9895 BBBBqn SPRITE 01/25/89 12:47:45 1Rec(s) Pub IO 9895 BFILqn BINDER 07/05/88 16:27:44 9Rec(s) Pub IO 9895 BLESqn BPL 11/18/88 14:37:32 34Rec(s) Pub IO 9895 BLOAqn BINDER 08/01/87 16:02:54 49Rec(s) Pub IO 9895 BNAGqn DATA 02/08/88 15:07:06 41Rec(s) Pub IO 9999 BNAUqn BPL 06/06/88 15:02:14 104Rec(s) Pri IO 9895 BNAVqn DATA 03/03/89 16:15:00 58Rec(s) Pri IO 9895 BSKLqn BPL 08/11/89 14:24:18 536Rec(s) Pub IO 9895 Transmit space for next page..

BPL - Burroughs Programming Language (low-level systems programming)
SPRITE - Modula-like OS implementation language
BINDER - linker instructions.

(ADSSqn is the BPL source for the document formatting utility)

four-letter file names were a bit of a pain (the system had
six character names, but the last two characters for CANDE
stored the usercode (9895 in EBCDIC is 'qn').

I wrote the MCP system intialization code; the source file
was SINSqn, the printer banner name was SINEqn. A colleague pointed
out that could be read as sine non qua which seemed quite
apropo for the system boot code :-).
--- Synchronet 3.21a-Linux NewsLink 1.2

From Bonita Montero@Bonita.Montero@gmail.com to comp.lang.c on Sun Dec 7 19:04:51 2025

From Newsgroup: comp.lang.c

Am 06.12.2025 um 18:33 schrieb Scott Lurndal:

Michael Sanders <porkchop@invalid.foo> writes:

Am I close? Missing anything you'd consider to be (or not) needed?

Technically, there is no such thing as a "binary" file. All files
are simply sequences of bytes with no format implied. Interpretation
of the file content is purely application dependent.

C-based applications have certain restrictions on text format
due to the use of the ASCII NUL code as a string terminator, but
that's C. The content of a text file processed by a different
language, or by C using application-defined string containers
can easily contain a NUL byte yet still be considered "text"
if that distinction is necessary.

Because of C/C++, a valid UTF-8 encoding will not include
the NUL byte.

You're a philosopher of language because you can't handle ambiguity. But
C is ambiguous at this point.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Richard Harnden@richard.nospam@gmail.invalid to comp.lang.c on Sun Dec 7 19:01:02 2025

From Newsgroup: comp.lang.c

On 06/12/2025 01:05, Michael Sanders wrote:

Am I close? Missing anything you'd consider to be (or not) needed?

A text file is supposed to end with a '\n' (M$, of course, largely
ignores this convention), but a quick test could be:

f = fopen(path, "rb");

fseek(f, -1, SEEK_END);

if ( (c = fgetc(f)) == '\n' )
printf("Text\n");
else
printf("Binary\n");

fclose(f);

Be aware of false positives/negatives, because I'm sure there will be
plenty :)

<stdio.h>

/*
* Checks if a file is likely a binary by examining its content
* for NULL bytes (0x00) or unusual control characters.
* Returns 0 if text, 1 if binary or file open failure.
*/

int is_binary_file(const char *path) {
FILE *f = fopen(path, "rb");
if (!f) return 1; // cannot open file, treat as error/fail check

unsigned char buf[65536];
size_t n, i;

while ((n = fread(buf, 1, sizeof(buf), f)) > 0) {
for (i = 0; i < n; i++) {
unsigned char c = buf[i];

// 1. check for the NULL byte (strong indicator of binary data)
if (c == 0x00) {
fclose(f);
return 1; // IS binary
}

// 2. check for C0 control codes (0x01-0x1F), excluding known
// text formatting characters: 0x09 (Tab), 0x0A (LF), 0x0D (CR)
if (c < 0x20) {
if (c != 0x09 && c != 0x0A && c != 0x0D) {
fclose(f);
return 1; // IS binary (contains unexpected control code)
}
}
}
}

fclose(f);
return 0; // NOT binary
}

--- Synchronet 3.21a-Linux NewsLink 1.2

From Richard Heathfield@rjh@cpax.org.uk to comp.lang.c on Sun Dec 7 21:51:36 2025

From Newsgroup: comp.lang.c

On 07/12/2025 19:01, Richard Harnden wrote:

On 06/12/2025 01:05, Michael Sanders wrote:

Am I close? Missing anything you'd consider to be (or not)
needed?

A text file is supposed to end with a '\n' (M$, of course,
largely ignores this convention), but a quick test could be:

f = fopen(path, "rb");

fseek(f, -1, SEEK_END);

Not guaranteed to work with binary files...

7.19.9.2(3)

A binary stream need not meaningfully support fseek calls with a
whence value of SEEK_END.

...or text files.

7.19.9.2(4)

For a text stream, either offset shall be zero, or offset shall
be a value returned by an earlier successful call to the ftell
function on a stream associated with the same file and whence
shall be SEEK_SET.
--
Richard Heathfield
Email: rjh at cpax dot org dot uk
"Usenet is a strange place" - dmr 29 July 1999
Sig line 4 vacant - apply within
--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.lang.c on Sun Dec 7 14:42:39 2025

From Newsgroup: comp.lang.c

On 12/5/2025 5:05 PM, Michael Sanders wrote:

int is_binary_file(const char *path) {

[...]

You can return a float from is_binary_file() to show a probability? Not exactly sure how you can 100% guarantee it...
--- Synchronet 3.21a-Linux NewsLink 1.2

From Richard Harnden@richard.nospam@gmail.invalid to comp.lang.c on Sun Dec 7 22:49:52 2025

From Newsgroup: comp.lang.c

On 07/12/2025 21:51, Richard Heathfield wrote:

On 07/12/2025 19:01, Richard Harnden wrote:

On 06/12/2025 01:05, Michael Sanders wrote:

Am I close? Missing anything you'd consider to be (or not)
needed?

A text file is supposed to end with a '\n' (M$, of course, largely
ignores this convention), but a quick test could be:

f = fopen(path, "rb");

fseek(f, -1, SEEK_END);

Not guaranteed to work with binary files...

7.19.9.2(3)

A binary stream need not meaningfully support fseek calls with a whence value of SEEK_END.

...or text files.

7.19.9.2(4)

For a text stream, either offset shall be zero, or offset shall
be a value returned by an earlier successful call to the ftell function
on a stream associated with the same file and whence shall be SEEK_SET.

Ah, okay. Thanks.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Bonita Montero@Bonita.Montero@gmail.com to comp.lang.c on Mon Dec 8 13:51:49 2025

From Newsgroup: comp.lang.c

Am 07.12.2025 um 22:51 schrieb Richard Heathfield:

On 07/12/2025 19:01, Richard Harnden wrote:

On 06/12/2025 01:05, Michael Sanders wrote:

Am I close? Missing anything you'd consider to be (or not)
needed?

A text file is supposed to end with a '\n' (M$, of course, largely
ignores this convention), but a quick test could be:

f = fopen(path, "rb");

fseek(f, -1, SEEK_END);

Not guaranteed to work with binary files...

7.19.9.2(3)

A binary stream need not meaningfully support fseek calls with a
whence value of SEEK_END.

From the glibc Reference Manual:

“The distinction between text and binary streams is only meaningful on systems where text files
have a different internal representation. On Unix systems, there is no difference between the
two; the ‘b’ is accepted but ignored.”

...or text files.

7.19.9.2(4)

For a text stream, either offset shall be zero, or offset shall
be a value returned by an earlier successful call to the ftell
function on a stream associated with the same file and whence shall be SEEK_SET.

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.lang.c on Mon Dec 8 16:02:51 2025

From Newsgroup: comp.lang.c

Richard Heathfield <rjh@cpax.org.uk> writes:

On 07/12/2025 19:01, Richard Harnden wrote:

On 06/12/2025 01:05, Michael Sanders wrote:

Am I close? Missing anything you'd consider to be (or not)
needed?

A text file is supposed to end with a '\n' (M$, of course,
largely ignores this convention), but a quick test could be:

f = fopen(path, "rb");

fseek(f, -1, SEEK_END);

Not guaranteed to work with binary files...

7.19.9.2(3)

A binary stream need not meaningfully support fseek calls with a
whence value of SEEK_END.

Not to mention that the ASCII LF character _is_ a valid binary
character, so the presence or absence of an LF as the last byte of a file doesn't indicate anything useful.

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.lang.c on Mon Dec 8 16:04:11 2025

From Newsgroup: comp.lang.c

Bonita Montero <Bonita.Montero@gmail.com> writes:

Am 07.12.2025 um 22:51 schrieb Richard Heathfield:

On 07/12/2025 19:01, Richard Harnden wrote:

On 06/12/2025 01:05, Michael Sanders wrote:

Am I close? Missing anything you'd consider to be (or not)
needed?

A text file is supposed to end with a '\n' (M$, of course, largely
ignores this convention), but a quick test could be:

f = fopen(path, "rb");

fseek(f, -1, SEEK_END);

Not guaranteed to work with binary files...

7.19.9.2(3)

A binary stream need not meaningfully support fseek calls with a
whence value of SEEK_END.

From the glibc Reference Manual:

Has nothing to do with glibc. Dates back to the earliest
days of unix, and is codified by POSIX/SUS.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Mon Dec 8 17:40:59 2025

From Newsgroup: comp.lang.c

On Sat, 6 Dec 2025 02:00:22 -0000 (UTC), Lew Pitcher wrote:

HTH

Yes sir it really does. I'll study your post closely &
dont think because my reply is brief that I'm not
considering your words.

Thank you Lew.
--
:wq
Mike Sanders
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Mon Dec 8 17:46:22 2025

From Newsgroup: comp.lang.c

On Fri, 05 Dec 2025 17:42:30 -0800, Keith Thompson wrote:

There is no completely reliable way to do this, but you might be
able to make a reasonable guess. A binary file might happen to
contain only byte values that represent printable characters.

I suspected this was going to be the case actually.

Please use the term "null bytes", not "NULL bytes". NULL is a standard
macro that expands to a null pointer constant.

Okay, will do.

It seems odd to say that a file is assumed to be binary if you can't
open it. I suggest having the function return more than two distinct
values:

- File seems to be binary
- File seems to be text
- Could be either
- Something went wrong

An enum is probably a good choice.

Aye, that's an interesting way to look at it.

0x00 -> '\0'
0x20 -> ' '
0x09 -> '\t'
0x0A -> '\n'
0x0D -> '\r'

Well, I got too fancy there...

Depending on how far you want to get into it, distinguishing between
text and binary files is anywhere from difficult to literally
impossible.

Thanks for your expertise Keith, I appreciate your insight.
--
:wq
Mike Sanders
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Mon Dec 8 17:48:13 2025

From Newsgroup: comp.lang.c

On Sat, 6 Dec 2025 02:42:39 -0000 (UTC), Kaz Kylheku wrote:

How about:

[...]

You sir are an OCD coder =)
--
:wq
Mike Sanders
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Mon Dec 8 17:56:26 2025

From Newsgroup: comp.lang.c

On Sat, 6 Dec 2025 03:14:55 -0500, Paul wrote:

It is the year 2025.

How many times do you suppose someone has considered this question ?

I'm not trying to be a smart ass by saying this, just that the
question is bound to be nuanced. You can do a fast and totally
inaccurate determination. You can do a computationally expensive
or I/O expensive determination.

I get it Paul, but as with all things, there's lots of opinions on this.

There has to be a reason for doing this, and a damn good reason.

*******

There is the "file" command.

It was invented in 1973.

https://en.wikipedia.org/wiki/File_%28command%29

The beauty of this command, is it has some sort of ordered
approach to file determination.

And... is not generally available on Windows & causes a 3rd party
dependency. Not to say that you're not correct in your thinking
but I want portability. And there are lots of things I want that
dont always happen either...

[...]

Thanks Paul, actually I do appreciate your rant & the detailed examples
you cite. I'm in the same place with my project, it can be very frustrating.
--
:wq
Mike Sanders
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Mon Dec 8 18:02:26 2025

From Newsgroup: comp.lang.c

On Sat, 6 Dec 2025 20:37:22 -0500, James Kuyper wrote:

NULL is a macro that expands to a null pointer constant. I think you
mean "null character". This isn't just nit-picking -C is a
case-sensitive language, so it's essential to pay attention to case.

Of yeah. I'm at the stage of simultaneously getting a lot wrong,
a lot right, & that makes my code dangerous at times. I'm slowly
getting there.

You should return a distinct value for file open failure - a file that
cannot be opened cannot be determined to be either a text or a binary file.

Noted.

You really cannot distinguish with certainty whether a file is a text
file or a binary file based solely upon the contents. A file whose
format is an array of two-byte 2's complement little-endian integers
would normally be considered binary, yet it might happen to contain
integers whose bytes all happen to be printable characters.

Ah, I want it to be simple, but that's not the case.

This implies the use of the isprint() function; the only other
characters you need to handle specifically are '\t', '\n', and ' '.
Since the result returned by isprint() is locale-dependent, the program should, at least optionally, use setlocale().

Hmm, now that's a curve-ball I did not see coming! I've got to think
about this...

Paul, thank you for sharing your knowledge, I appreciate your help sir.
--
:wq
Mike Sanders
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Mon Dec 8 18:04:51 2025

From Newsgroup: comp.lang.c

On Sun, 7 Dec 2025 03:43:58 -0000 (UTC), Waldek Hebisch wrote:

You miss definition: you should first decide what you consider to
be a binary file (this is hard part).

Yes. This is it - everything right here Waldek, that is my entire
problem.

Thank you for you post, it is interesting reading.
--
:wq
Mike Sanders
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Mon Dec 8 18:07:26 2025

From Newsgroup: comp.lang.c

On Sun, 7 Dec 2025 19:01:02 +0000, Richard Harnden wrote:

A text file is supposed to end with a '\n' (M$, of course, largely
ignores this convention), but a quick test could be:

f = fopen(path, "rb");

fseek(f, -1, SEEK_END);

if ( (c = fgetc(f)) == '\n' )
printf("Text\n");
else
printf("Binary\n");

fclose(f);

Be aware of false positives/negatives, because I'm sure there will be
plenty :)

Thank you Richard. Interesting thoughts.
--
:wq
Mike Sanders
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Mon Dec 8 18:09:08 2025

From Newsgroup: comp.lang.c

On Sun, 7 Dec 2025 14:42:39 -0800, Chris M. Thomasson wrote:

You can return a float from is_binary_file() to show a probability? Not exactly sure how you can 100% guarantee it...

Ha!

You know, that's a crazy idea but a darn cool idea at the same time!
--
:wq
Mike Sanders
--- Synchronet 3.21a-Linux NewsLink 1.2

From Bonita Montero@Bonita.Montero@gmail.com to comp.lang.c on Mon Dec 8 19:27:25 2025

From Newsgroup: comp.lang.c

Am 08.12.2025 um 17:04 schrieb Scott Lurndal:

Bonita Montero <Bonita.Montero@gmail.com> writes:

Am 07.12.2025 um 22:51 schrieb Richard Heathfield:

On 07/12/2025 19:01, Richard Harnden wrote:

On 06/12/2025 01:05, Michael Sanders wrote:

Am I close? Missing anything you'd consider to be (or not)
needed?

A text file is supposed to end with a '\n' (M$, of course, largely
ignores this convention), but a quick test could be:

f = fopen(path, "rb");

fseek(f, -1, SEEK_END);

Not guaranteed to work with binary files...

7.19.9.2(3)

A binary stream need not meaningfully support fseek calls with a
whence value of SEEK_END.

From the glibc Reference Manual:

Has nothing to do with glibc. Dates back to the earliest
days of unix, and is codified by POSIX/SUS.

Where did I say that this is tue for glibc only ?

--- Synchronet 3.21a-Linux NewsLink 1.2

From bart@bc@freeuk.com to comp.lang.c on Mon Dec 8 18:44:33 2025

From Newsgroup: comp.lang.c

On 08/12/2025 18:04, Michael Sanders wrote:

On Sun, 7 Dec 2025 03:43:58 -0000 (UTC), Waldek Hebisch wrote:

You miss definition: you should first decide what you consider to
be a binary file (this is hard part).

Yes. This is it - everything right here Waldek, that is my entire
problem.

It's not clear what the actual problem is. What is the use-case for a
function that tells you whether any file /might/ be a text-file based on speculative analysis of its contents?

Is the result /meant/ to be fuzzy?
--- Synchronet 3.21a-Linux NewsLink 1.2

From Kaz Kylheku@046-301-5902@kylheku.com to comp.lang.c on Mon Dec 8 19:26:07 2025

From Newsgroup: comp.lang.c

On 2025-12-08, Michael Sanders <porkchop@invalid.foo> wrote:

On Sat, 6 Dec 2025 02:42:39 -0000 (UTC), Kaz Kylheku wrote:

How about:

[...]

You sir are an OCD coder =)

At last, someone seems to have gotten the joke.
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca
--- Synchronet 3.21a-Linux NewsLink 1.2

From Bonita Montero@Bonita.Montero@gmail.com to comp.lang.c on Mon Dec 8 20:36:17 2025

From Newsgroup: comp.lang.c

Am 06.12.2025 um 02:05 schrieb Michael Sanders:

Am I close? Missing anything you'd consider to be (or not) needed?

<stdio.h>

/*
* Checks if a file is likely a binary by examining its content
* for NULL bytes (0x00) or unusual control characters.
* Returns 0 if text, 1 if binary or file open failure.
*/

int is_binary_file(const char *path) {
FILE *f = fopen(path, "rb");
if (!f) return 1; // cannot open file, treat as error/fail check

unsigned char buf[65536];
size_t n, i;

while ((n = fread(buf, 1, sizeof(buf), f)) > 0) {
for (i = 0; i < n; i++) {
unsigned char c = buf[i];

// 1. check for the NULL byte (strong indicator of binary data)
if (c == 0x00) {
fclose(f);
return 1; // IS binary
}

// 2. check for C0 control codes (0x01-0x1F), excluding known
// text formatting characters: 0x09 (Tab), 0x0A (LF), 0x0D (CR)
if (c < 0x20) {
if (c != 0x09 && c != 0x0A && c != 0x0D) {
fclose(f);
return 1; // IS binary (contains unexpected control code)
}
}
}
}

fclose(f);
return 0; // NOT binary
}

Much smaller and with error handling for free:

bool binary( path pth )
{
ifstream ifs;
ifs.exceptions( ios_base::badbit );
ifs.open( pth, ios_base::binary | ios_base::ate );
streampos pos = ifs.tellg();
if( pos > (size_t)-1 ) // for 32 bit platforms with large files
throw ios_base::failure( "file too large", error_code( (int)errc::file_too_large, generic_category() ) );
string buf( (size_t)pos, 0 );
ifs.seekg( 0 );
ifs.read( buf.data(), buf.size() );
auto check = []( unsigned char c ) { return c < 0x20 && c != '\r'
&& c != '\n' && c != '\t'; };
return find_if( buf.begin(), buf.end(), check ) == buf.end();
}

--- Synchronet 3.21a-Linux NewsLink 1.2

From Richard Heathfield@rjh@cpax.org.uk to comp.lang.c on Mon Dec 8 19:42:47 2025

From Newsgroup: comp.lang.c

On 08/12/2025 19:26, Kaz Kylheku wrote:

On 2025-12-08, Michael Sanders <porkchop@invalid.foo> wrote:

On Sat, 6 Dec 2025 02:42:39 -0000 (UTC), Kaz Kylheku wrote:

How about:

[...]

You sir are an OCD coder =)

At last, someone seems to have gotten the joke.

An OCD coder would have remembered that fopen takes two
parameters. :-o
--
Richard Heathfield
Email: rjh at cpax dot org dot uk
"Usenet is a strange place" - dmr 29 July 1999
Sig line 4 vacant - apply within
--- Synchronet 3.21a-Linux NewsLink 1.2

From Bonita Montero@Bonita.Montero@gmail.com to comp.lang.c on Mon Dec 8 20:50:06 2025

From Newsgroup: comp.lang.c

And if you like it fast:

bool binary( path pth )
{

static vector<bool> valid = []()
{
vector<bool> ret( numeric_limits<unsigned char>::max() );
for( size_t c = ret.size(); c--; )
ret[c] = c >= 0x20 || c == '\r' || c == '\n' || c == '\t';
return ret;
}();
ifstream ifs;
ifs.exceptions( ios_base::failbit | ios_base::badbit );
ifs.open( pth, ios_base::binary | ios_base::ate );
streampos pos = ifs.tellg();
if( pos > (size_t)-1 )
throw ios_base::failure( "file too large", error_code( (int)errc::file_too_large, generic_category() ) );
string buf( (size_t)pos, 0 );
ifs.seekg( 0 );
ifs.read( buf.data(), buf.size() );
return find_if( buf.begin(), buf.end(), []( unsigned char c ) {
return !valid[c]; } ) == buf.end();
}

The cool thing about that is that the array valid is initialized only
once and threads-sfe.
--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.lang.c on Mon Dec 8 20:16:52 2025

From Newsgroup: comp.lang.c

Michael Sanders <porkchop@invalid.foo> writes:

On Sat, 6 Dec 2025 03:14:55 -0500, Paul wrote:

It is the year 2025.

How many times do you suppose someone has considered this question ?

I'm not trying to be a smart ass by saying this, just that the
question is bound to be nuanced. You can do a fast and totally
inaccurate determination. You can do a computationally expensive
or I/O expensive determination.

I get it Paul, but as with all things, there's lots of opinions on this.

There has to be a reason for doing this, and a damn good reason.

*******

There is the "file" command.

It was invented in 1973.

https://en.wikipedia.org/wiki/File_%28command%29

The beauty of this command, is it has some sort of ordered
approach to file determination.

And... is not generally available on Windows

It is open source and could be built for windows.

It's also included in any linux distribution running
under WSL.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Keith Thompson@Keith.S.Thompson+u@gmail.com to comp.lang.c on Mon Dec 8 14:43:58 2025

From Newsgroup: comp.lang.c

Michael Sanders <porkchop@invalid.foo> writes:
[...]

For yet another set of unreliable hueristics for guessing whether a file
is text or binary, you can take a look at Perl's built-in "-T" and "-B" operators.

The "-T" and "-B" tests work as follows. The first block
or so of the file is examined to see if it is valid
UTF-8 that includes non-ASCII characters. If so, it's a
"-T" file. Otherwise, that same portion of the file is
examined for odd characters such as strange control codes
or characters with the high bit set. If more than a third
of the characters are strange, it's a "-B" file; otherwise
it's a "-T" file. Also, any file containing a zero byte
in the examined portion is considered a binary file. (If
executed within the scope of a use locale which includes
"LC_CTYPE", odd characters are anything that isn't a
printable nor space in the current locale.) If "-T" or
"-B" is used on a filehandle, the current IO buffer is
examined rather than the first block. Both "-T" and "-B"
return true on an empty file, or a file at EOF when testing
a filehandle. Because you have to read a file to do the "-T"
test, on most occasions you want to use a "-f" against the
file first, as in "next unless -f $file && -T $file".

It's not clear how big a "block" is. For an empty file, both -T
and -B are true. I don't know whether there are other cases where
both are true, or where both are false.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */
--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.lang.c on Tue Dec 9 09:03:36 2025

From Newsgroup: comp.lang.c

On 08/12/2025 21:16, Scott Lurndal wrote:

Michael Sanders <porkchop@invalid.foo> writes:

On Sat, 6 Dec 2025 03:14:55 -0500, Paul wrote:

It is the year 2025.

How many times do you suppose someone has considered this question ?

I'm not trying to be a smart ass by saying this, just that the
question is bound to be nuanced. You can do a fast and totally
inaccurate determination. You can do a computationally expensive
or I/O expensive determination.

I get it Paul, but as with all things, there's lots of opinions on this.

There has to be a reason for doing this, and a damn good reason.

*******

There is the "file" command.

It was invented in 1973.

https://en.wikipedia.org/wiki/File_%28command%29

The beauty of this command, is it has some sort of ordered
approach to file determination.

And... is not generally available on Windows

It is open source and could be built for windows.

It's also included in any linux distribution running
under WSL.

It is available anywhere you find Windows ports of common *nix
utilities, such as the msys2 project. (And while an msys2 installation
can be quite large, it's possible to pull out individual utilities if
you need to.) Still, it's fair to say that most Windows installations
don't have it.

But surely on Windows you can just look at the file extension - if it is ".txt", it's a text file, otherwise it's a binary file.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Richard Heathfield@rjh@cpax.org.uk to comp.lang.c on Tue Dec 9 09:43:04 2025

From Newsgroup: comp.lang.c

On 09/12/2025 08:03, David Brown wrote:

<snip>

But surely on Windows you can just look at the file extension -
if it is ".txt", it's a text file, otherwise it's a binary file.

It is now almost a decade since I last made (approximately
weekly) use of a Windows system. For the 25 years prior to that I
used a variety of extensions for text filenames, including:

txt - generic textfile
doc - documentation*
c - C source
cpp - C++ source
h - C or C++ header
tex - LaTeX source
ly - Lilypond source
eml - email backup
cfg - configuration files
ini - initialisation files
- Makefiles and READMEs
sh - shell script
asm - assembly language source
i - C preprocessor output
bin - binary (contains only '0', '1', and '\n') - I found less
than a dozen of these, but there they were.

These are, of course, all also binary files. Whether a file that
contains only printable characters is text or binary is really a
matter of perspective more than anything else.
--
Richard Heathfield
Email: rjh at cpax dot org dot uk
"Usenet is a strange place" - dmr 29 July 1999
Sig line 4 vacant - apply within
--- Synchronet 3.21a-Linux NewsLink 1.2

From Richard Harnden@richard.nospam@gmail.invalid to comp.lang.c on Tue Dec 9 10:17:25 2025

From Newsgroup: comp.lang.c

On 09/12/2025 09:43, Richard Heathfield wrote:

ly - Lilypond source

Off topic, but ... Lilypond is a lovely thing :)

--- Synchronet 3.21a-Linux NewsLink 1.2

From tTh@tth@none.invalid to comp.lang.c on Tue Dec 9 12:22:21 2025

From Newsgroup: comp.lang.c

On 12/9/25 09:03, David Brown wrote:

But surely on Windows you can just look at the file extension - if it is ".txt", it's a text file, otherwise it's a binary file.

And what about PNM files who can be pure ascii encoded,
but was image files ?
--
** **
* tTh des Bourtoulots *
* http://maison.tth.netlib.re/ *
** **
--- Synchronet 3.21a-Linux NewsLink 1.2

From Paul@nospam@needed.invalid to comp.lang.c on Tue Dec 9 06:38:47 2025

From Newsgroup: comp.lang.c

On Tue, 12/9/2025 3:03 AM, David Brown wrote:

On 08/12/2025 21:16, Scott Lurndal wrote:

Michael Sanders <porkchop@invalid.foo> writes:

On Sat, 6 Dec 2025 03:14:55 -0500, Paul wrote:

It is the year 2025.

How many times do you suppose someone has considered this question ?

I'm not trying to be a smart ass by saying this, just that the
question is bound to be nuanced. You can do a fast and totally
inaccurate determination. You can do a computationally expensive
or I/O expensive determination.

I get it Paul, but as with all things, there's lots of opinions on this. >>>

There has to be a reason for doing this, and a damn good reason.

*******

There is the "file" command.

It was invented in 1973.

https://en.wikipedia.org/wiki/File_%28command%29

The beauty of this command, is it has some sort of ordered
approach to file determination.

And... is not generally available on Windows

It is open source and could be built for windows.

It's also included in any linux distribution running
under WSL.

It is available anywhere you find Windows ports of common *nix utilities, such as the msys2 project. (And while an msys2 installation can be quite large, it's possible to pull out individual utilities if you need to.) Still, it's fair to say that most Windows installations don't have it.

But surely on Windows you can just look at the file extension - if it is ".txt", it's a text file, otherwise it's a binary file.

There are a couple ways to get it.

The problem with this one, is /etc/magic is as old as the hills
and does not have nearly as much capability. On the plus side,
it's not going to burn your house down either.

https://gnuwin32.sourceforge.net/packages/file.htm

A second source, is Cygwin, but again, it might depend on
when the port was done. Doing it this way has to be better
than the previous link, just because the previous one is
so old.

https://cygwin.com/packages/summary/file.html

And the Wiki on msys2 says this:

"MSYS2 ("minimal system 2") is a software distribution and a
development platform for Microsoft Windows, based on Mingw-w64 and Cygwin
"

It still means when the release was done, could matter.

I started with Cygwin64. This is an example of an executable, but
it relies on other dependencies.

https://mirror.csclub.uwaterloo.ca/cygwin/x86_64/release/file/file-5.46-1-x86_64.tar.xz

The installer is here.

https://cygwin.com/setup-x86_64.exe

# After installation, I checked the dependencies. This does not
# help you find the /etc/magic file for its usage.

$ cygcheck /usr/bin/file.exe
C:\cygwin64\bin\file.exe
C:\cygwin64\bin\cygmagic-1.dll
C:\cygwin64\bin\cygbz2-1.dll
C:\cygwin64\bin\cygwin1.dll
C:\WINDOWS\system32\KERNEL32.dll
C:\WINDOWS\system32\ntdll.dll
C:\WINDOWS\system32\KERNELBASE.dll
C:\cygwin64\bin\cyglzma-5.dll
C:\cygwin64\bin\cygz.dll
C:\cygwin64\bin\cygzstd-1.dll

Testing did not go well. I tested the "find.exe" in Cygwin64
and it did not finish. I used Process Monitor to see what it
was doing, and there was a lot of registry activity. (There
should not be registry activity by find.exe or file.exe )

I tried the file.exe command and it didn't provide output
and the machine hung. My machine never hangs. It's a model
citizen. Windows Defender did not trip. An offline scan
with Windows Defender did not find anything. This is possibly
Process Monitor using all RAM, but that does not normally
happen until 20 minutes or more have passed, and I was only
running tracing for a minute or two.

Cygwin materials are held on mirror sites, and I was using
a mirror (University of Waterloo). For the time being, I would
recommend some isolation while you test that.

*******

On to msys2.

https://www.msys2.org/

Name: msys2-x86_64-20250830.exe
Size: 93,680,251 bytes (89 MiB)
SHA256: B54705073678D32686A2CC356BB552363429E6CCBABBFECCB6D3CB7EC101E73B

"Last analysis 22 hours ago", so it is likely someone in this thread triggered a retest.

https://www.virustotal.com/gui/file/b54705073678d32686a2cc356bb552363429e6ccbabbfeccb6d3cb7ec101e73b [Clean]

Install on disk is 350MB in C:\msys64

https://www.msys2.org/docs/installer/

C:/msys64/msys2_shell.cmd -defterm -here -no-start -ucrt64 # Do not run elevated (use the unelevated terminal)
# Windows Terminal prompt changes color

$ cd /c/msys64/usr/bin
$ file.exe file.exe
file.exe: PE32+ executable for MS Windows 5.02 (console), x86-64 (stripped to external PDB), 10 sections
$ cd /s/disktype
$ file disktype.exe
disktype.exe: PE32 executable for MS Windows 4.00 (console), Intel i386, 16 sections # cygwin32 executable?
# I change directory to the corrupted Sent file and check it with the msys2 version.
$ file Sent
Sent: Mailbox text, 1st line "From - Wed Nov 26 06:13:35 2008"
# I compare to the WSL file command
$ file Sent
Sent: Non-ISO extended-ASCII text, with very long lines, with CRLF, NEL line terminators # The corruption detection...

This tells me the msys2 has an older version of magic determination on the file.exe command .

And for the cygwin64, use the rubber gloves on it.
It did not work as expected. Use your SafeHex handling
techniques, until it proves in for you.

Paul
--- Synchronet 3.21a-Linux NewsLink 1.2

From Bonita Montero@Bonita.Montero@gmail.com to comp.lang.c on Tue Dec 9 15:09:11 2025

From Newsgroup: comp.lang.c

I made a little benchmark that compares the table code against the
convention
&&-cascaded code. On my Zen4-PC the table code is about 25% faster with
clang.
I'm doing a AVX2 and AVX-512 version now. I guess it's about 20 - 30 times faster.

#include <iostream>

#include <filesystem>
#include <fstream>
#include <algorithm>
#include <chrono>

using namespace std;
using namespace filesystem;
using namespace chrono;

template<bool Table>
bool binary( string const &buf );

int main()
{
ifstream ifs;
ifs.exceptions( ios_base::failbit | ios_base::badbit );
ifs.open( "main.cpp", ios_base::binary | ios_base::ate );
streampos pos = ifs.tellg();
if( pos > (size_t)-1 )
throw ios_base::failure( "file too large", error_code( (int)errc::file_too_large, generic_category() ) );
string buf( (size_t)pos, 0 );
ifs.seekg( 0 );
ifs.read( buf.data(), buf.size() );
binary<true>( buf );
auto bench = [&]<bool Table>( bool_constant<Table> ) -> int
{
int ret = 0;
auto start = high_resolution_clock::now();
for( size_t r = 1'000'000; r; --r )
ret += binary<Table>( buf );
double secs = (double)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count() / 1.0e9;
cout << (Table ? "table" : "check") << ": " << secs << endl;
return ret;
};
int ret = bench( false_type() );
ret += bench( true_type() );
return ret;
}

template<bool Table>
bool binary( string const &buf )
{
static auto invalid = []( unsigned char c ) static { return c <
0x20 && c != '\r' && c != '\n' && c != '\t'; };
if constexpr( Table )
{
static vector<char> invalidTbl = Table ? []()
{
vector<char> ret( numeric_limits<unsigned char>::max() );
for( size_t c = ret.size(); c--; )
ret[c] = invalid( (unsigned char)c );
return ret;
}() : vector<char>();
return find_if( buf.begin(), buf.end(), [&]( unsigned char c )
{ return invalidTbl[c]; } ) == buf.end();
}
else
return find_if( buf.begin(), buf.end(), invalid ) == buf.end();
}
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.lang.c on Tue Dec 9 17:31:09 2025

From Newsgroup: comp.lang.c

On Tue, 9 Dec 2025 06:38:47 -0500
Paul <nospam@needed.invalid> wrote:

On Tue, 12/9/2025 3:03 AM, David Brown wrote:

On 08/12/2025 21:16, Scott Lurndal wrote:

Michael Sanders <porkchop@invalid.foo> writes:

On Sat, 6 Dec 2025 03:14:55 -0500, Paul wrote:

It is the year 2025.

How many times do you suppose someone has considered this
question ?

I'm not trying to be a smart ass by saying this, just that the
question is bound to be nuanced. You can do a fast and totally
inaccurate determination. You can do a computationally expensive
or I/O expensive determination.

I get it Paul, but as with all things, there's lots of opinions
on this.

There has to be a reason for doing this, and a damn good reason.

*******

There is the "file" command.

It was invented in 1973.

�� https://en.wikipedia.org/wiki/File_%28command%29

The beauty of this command, is it has some sort of ordered
approach to file determination.

And... is not generally available on Windows

It is open source and could be built for windows.

It's also included in any linux distribution running
under WSL.

It is available anywhere you find Windows ports of common *nix
utilities, such as the msys2 project.� (And while an msys2
installation can be quite large, it's possible to pull out
individual utilities if you need to.)� Still, it's fair to say that
most Windows installations don't have it.

But surely on Windows you can just look at the file extension - if
it is ".txt", it's a text file, otherwise it's a binary file.

There are a couple ways to get it.

The problem with this one, is /etc/magic is as old as the hills
and does not have nearly as much capability. On the plus side,
it's not going to burn your house down either.

https://gnuwin32.sourceforge.net/packages/file.htm

A second source, is Cygwin, but again, it might depend on
when the port was done. Doing it this way has to be better
than the previous link, just because the previous one is
so old.

https://cygwin.com/packages/summary/file.html

And the Wiki on msys2 says this:

"MSYS2 ("minimal system 2") is a software distribution and a
development platform for Microsoft Windows, based on Mingw-w64
and Cygwin "

It still means when the release was done, could matter.

I started with Cygwin64. This is an example of an executable, but
it relies on other dependencies.

https://mirror.csclub.uwaterloo.ca/cygwin/x86_64/release/file/file-5.46-1-x86_64.tar.xz

The installer is here.

https://cygwin.com/setup-x86_64.exe

# After installation, I checked the dependencies. This does not
# help you find the /etc/magic file for its usage.

$ cygcheck /usr/bin/file.exe
C:\cygwin64\bin\file.exe
C:\cygwin64\bin\cygmagic-1.dll
C:\cygwin64\bin\cygbz2-1.dll
C:\cygwin64\bin\cygwin1.dll
C:\WINDOWS\system32\KERNEL32.dll
C:\WINDOWS\system32\ntdll.dll
C:\WINDOWS\system32\KERNELBASE.dll
C:\cygwin64\bin\cyglzma-5.dll
C:\cygwin64\bin\cygz.dll
C:\cygwin64\bin\cygzstd-1.dll

Testing did not go well. I tested the "find.exe" in Cygwin64
and it did not finish. I used Process Monitor to see what it
was doing, and there was a lot of registry activity. (There
should not be registry activity by find.exe or file.exe )

I tried the file.exe command and it didn't provide output
and the machine hung. My machine never hangs. It's a model
citizen. Windows Defender did not trip. An offline scan
with Windows Defender did not find anything. This is possibly
Process Monitor using all RAM, but that does not normally
happen until 20 minutes or more have passed, and I was only
running tracing for a minute or two.

Cygwin materials are held on mirror sites, and I was using
a mirror (University of Waterloo). For the time being, I would
recommend some isolation while you test that.

*******

On to msys2.

https://www.msys2.org/

Name: msys2-x86_64-20250830.exe
Size: 93,680,251 bytes (89 MiB)
SHA256:
B54705073678D32686A2CC356BB552363429E6CCBABBFECCB6D3CB7EC101E73B

"Last analysis 22 hours ago", so it is likely someone in this thread triggered a retest.

https://www.virustotal.com/gui/file/b54705073678d32686a2cc356bb552363429e6ccbabbfeccb6d3cb7ec101e73b
[Clean]

Install on disk is 350MB in C:\msys64

https://www.msys2.org/docs/installer/

C:/msys64/msys2_shell.cmd -defterm -here -no-start -ucrt64 # Do not
run elevated (use the unelevated terminal) # Windows Terminal prompt
changes color

$ cd /c/msys64/usr/bin
$ file.exe file.exe
file.exe: PE32+ executable for MS Windows 5.02 (console), x86-64
(stripped to external PDB), 10 sections $ cd /s/disktype
$ file disktype.exe
disktype.exe: PE32 executable for MS Windows 4.00 (console), Intel
i386, 16 sections # cygwin32 executable? # I change directory to
the corrupted Sent file and check it with the msys2 version. $ file
Sent Sent: Mailbox text, 1st line "From - Wed Nov 26 06:13:35 2008"
# I compare to the WSL file command
$ file Sent
Sent: Non-ISO extended-ASCII text, with very long lines, with CRLF,
NEL line terminators # The corruption detection...

Below is the list of files that I needed to run copy of file.exe
taken from msys2 on bare Windows:
Directory of C:\tmp\tst
12/09/2025 05:00 PM <DIR> .
12/09/2025 04:53 PM <DIR> ..
12/09/2025 04:54 PM 24,225 file.exe
12/09/2025 05:00 PM 10,357,200 magic.mgc
12/09/2025 04:57 PM 3,358,337 msys-2.0.dll
12/09/2025 04:58 PM 67,277 msys-bz2-1.dll
12/09/2025 04:58 PM 176,762 msys-lzma-5.dll
12/09/2025 04:57 PM 160,362 msys-magic-1.dll
12/09/2025 04:59 PM 88,576 msys-z.dll
12/09/2025 04:58 PM 1,136,580 msys-zstd-1.dll
8 File(s) 15,369,319 bytes
2 Dir(s) 760,461,594,624 bytes free
It's still less convenient than running from msys2 prompt, because
by default file.exe does not look for magic.mgc in the current
directory. So I had to run it as
'file.exe --magic-file magic.mgc my-files'
Can be "solved" by small envelop batch file, unless it creates some
other inconvenience.

This tells me the msys2 has an older version of magic determination
on the file.exe command .

And for the cygwin64, use the rubber gloves on it.
It did not work as expected. Use your SafeHex handling
techniques, until it proves in for you.

Paul

I never tried cygwin64. For what I do, the level of compatibility
provided by msys2 is sufficient.
I do have misfortune of using old cygwin, because it's how
Altera (then Intel then again Altera) packages their Nios2 SDK. During
the years it (cygwin) suffered from multiple issues caused by usual
malware that IT of our company stubbornly confuses for anti-malware.
The most recent example is Trend Micro virus that they call "antivirus"
that on few installations (but not on all of them) silently deletes some
vital components of cygwin.
Recently I was glad to discover that all components of said SDK that I
care about actually don't need cygwin. They are either proper Windows
exe, or bash, perl and python scripts. They work fine from msys2 prompt
and are actually faster that way than from within cygwin shell.
So now I have grand plan to gradually stop using old cygwin altogether.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Tue Dec 9 19:53:38 2025

From Newsgroup: comp.lang.c

On Mon, 8 Dec 2025 18:44:33 +0000, bart wrote:

It's not clear what the actual problem is. What is the use-case for a function that tells you whether any file /might/ be a text-file based on speculative analysis of its contents?

Is the result /meant/ to be fuzzy?

Hey bart.

What I mean is that since I have not yet defined a canonical standard
for my program, the goal here (to determine if my code can parse the file)
is unclear.

It means I need to plan much more *before* I write more code, no mean feat
when one is excited & ready to jump in =)
--
:wq
Mike Sanders
--- Synchronet 3.21a-Linux NewsLink 1.2

From Kaz Kylheku@046-301-5902@kylheku.com to comp.lang.c on Tue Dec 9 20:15:56 2025

From Newsgroup: comp.lang.c

On 2025-12-09, Richard Harnden <richard.nospam@gmail.invalid> wrote:

On 09/12/2025 09:43, Richard Heathfield wrote:

ly - Lilypond source

Off topic, but ... Lilypond is a lovely thing :)

Some fifteen years ago, I banged up this in it:

https://www.kylheku.com/~kaz/Prelude.pdf

(Change "pdf" to "mid" for MIDI.)

I imagine it must have improved quite a bit since then.
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca
--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.lang.c on Tue Dec 9 12:45:24 2025

From Newsgroup: comp.lang.c

On 12/8/2025 10:09 AM, Michael Sanders wrote:

On Sun, 7 Dec 2025 14:42:39 -0800, Chris M. Thomasson wrote:

You can return a float from is_binary_file() to show a probability? Not
exactly sure how you can 100% guarantee it...

Ha!

You know, that's a crazy idea but a darn cool idea at the same time!

;^)

It would be funny with a return of .5, lol

An error can be a negative result.
--- Synchronet 3.21a-Linux NewsLink 1.2

From James Kuyper@jameskuyper@alumni.caltech.edu to comp.lang.c on Tue Dec 9 16:23:07 2025

From Newsgroup: comp.lang.c

On Mon, 8 Dec 2025 18:44:33 +0000, bart wrote:

It's not clear what the actual problem is. What is the use-case for a function that tells you whether any file /might/ be a text-file based on speculative analysis of its contents?

Is the result /meant/ to be fuzzy?

The fundamental problem is that no analysis of the contents can give you anything other than a fuzzy result. There's nothing more clearly a
binary file than one that contains an array of binary floating point
numbers. However, just by chance, the binary numbers it contains could
happen to be such that every byte of that file can be interpreted as a
text character. How could an analysis of only the file tell you, with certainty, that it wasn't a text file?
--- Synchronet 3.21a-Linux NewsLink 1.2

From James Kuyper@jameskuyper@alumni.caltech.edu to comp.lang.c on Tue Dec 9 16:29:39 2025

From Newsgroup: comp.lang.c

On 2025-12-06 20:37, James Kuyper wrote:
...

"Data read in from a text stream will necessarily compare equal to the
data that were earlier written out to that stream only if: the data
consist only of printing characters and the control characters
horizontal tab and new-line; no new-line character is immediately
preceded by space characters; and the last character is a new-line character." (7.23.2p2).

I believe it therefore makes sense to consider something to be a text
file if it meets those requirements, and otherwise is a binary file.
Note that the last requirement implies that an empty file cannot qualify
as text - at a minimum, it must contain a new-line character.

This implies the use of the isprint() function; the only other
characters you need to handle specifically are '\t', '\n', and ' '.
Since the result returned by isprint() is locale-dependent, the program should, at least optionally, use setlocale().

I just realized an annoying complication. Whatever
implementation-specific method is used to indicate end-of-line can only
be portably identified as such by opening the file in text mode and
looking for the newline characters that it gets converted into. But
because of 7.23.2p2, text mode cannot be relied upon for precisely the
files we're trying to identify.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Tue Dec 9 21:38:57 2025

From Newsgroup: comp.lang.c

On Mon, 08 Dec 2025 14:43:58 -0800, Keith Thompson wrote:

For yet another set of unreliable hueristics for guessing whether a file
is text or binary, you can take a look at Perl's built-in "-T" and "-B" operators.

I guess the key finding in all of these cases really is unreliable.
Heuristics is the only 'constant' ie - an educated guess.

I wont win this battle, I can see it coming & then as James pointed
out, the ambiguous stuff with unicode...
--
:wq
Mike Sanders
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Tue Dec 9 21:49:12 2025

From Newsgroup: comp.lang.c

On Mon, 8 Dec 2025 19:26:07 -0000 (UTC), Kaz Kylheku wrote:

At last, someone seems to have gotten the joke.

I had originally intended to reply (without the hints):

c: 01100011
h: 01101000
u: 01110101
c: 01100011
k: 01101011
l: 01101100
e: 01100101

But figured it could lead to a shellacking...
--
:wq
Mike Sanders
--- Synchronet 3.21a-Linux NewsLink 1.2

From Keith Thompson@Keith.S.Thompson+u@gmail.com to comp.lang.c on Tue Dec 9 15:42:59 2025

From Newsgroup: comp.lang.c

Michael Sanders <porkchop@invalid.foo> writes:

On Mon, 8 Dec 2025 18:44:33 +0000, bart wrote:

It's not clear what the actual problem is. What is the use-case
for a function that tells you whether any file /might/ be a
text-file based on speculative analysis of its contents? Is
the result /meant/ to be fuzzy?

Hey bart.

What I mean is that since I have not yet defined a canonical
standard for my program, the goal here (to determine if my code
can parse the file) is unclear.

It means I need to plan much more *before* I write more code, no
mean feat when one is excited & ready to jump in =)

You say you want to parse the file. That implies that you expect
the file to have a certain format/syntax, and for parsing to fail
on a file that doesn't satisfy the syntax. In that case, I
speculate that determining whether the file is text or binary is
not useful. The way to determine whether you can parse it is
simply to try to parse it, and see whether that succeeds or fails.
For example, if I want to parse a file containing a C translation
unit, I can feed it to a C compiler (or just a parser if I have
one). If the file contains non-text bytes, that's just a special
case of a syntactically incorrect input, and the parser will
detect it. It should work similarly for whatever format you're
trying to parse. I doubt that you need to distinguish between
incorrect input that's pure text and incorrect input that's
"binary". If I'm right about this (which is by no means
certain), you could have saved a lot of time by telling us up
front *why* you want to distinguish between "text" and "binary"
files. On the other hand, I've seized on the word "parse", and I
may be reading too much into it.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */
--- Synchronet 3.21a-Linux NewsLink 1.2

From Paul@nospam@needed.invalid to comp.lang.c on Tue Dec 9 20:26:47 2025

From Newsgroup: comp.lang.c

On Tue, 12/9/2025 6:22 AM, tTh wrote:

On 12/9/25 09:03, David Brown wrote:

But surely on Windows you can just look at the file extension - if it is ".txt", it's a text file, otherwise it's a binary file.

And what about PNM files who can be pure ascii encoded,
but was image files ?

teapot.ppm 196,623 bytes

50 36 0A 32 35 36 20 32 35 36 0A 32 35 35 0A # P6
# 256 256
# 255
13 5C C0 13 5C C0 13 5C C0 13 5C C0 13 5C C0 # binary byte tuples 0x13 0x5C 0xC0

******************************************************************

teapot2.ppm 710,359 bytes

P3 # P3 is the ASCII format option
# Created by IrfanView # (How you change storage formats)
256 256
255
19 92 192 19 92 192 19 92 192 19 92 192 19 92 192 # Plain ASCII digits (inefficient)

PNM supports both ASCII and binary payloads.
The magic value of P3 or P6 indicates the PPM payload types in the examples.

********************************************************************

$ file *ppm
teapot.ppm: Netpbm image data, size = 256 x 256, rawbits, pixmap
teapot2.ppm: Netpbm image data, size = 256 x 256, pixmap, ASCII text

Paul

--- Synchronet 3.21a-Linux NewsLink 1.2

From Bonita Montero@Bonita.Montero@gmail.com to comp.lang.c on Wed Dec 10 09:18:03 2025

From Newsgroup: comp.lang.c

Now I've developed a benchmark which tests the static comparison approach
vs. the table approach vs. an AVX2-approach vs. an AVX-512 approach. This
are the results with clang 20:

check: 2.17442
table: 2.00056 (109%)
AVX-256: 0.183048 (1093%, 1188%)
AVX-512: 0.0639528 (286%, 3128%, 3400%)

The number in the brackets are the speedups against the before results.
So the AVX-512 solution is 30+ times than the byte-wise solutions.

This is the code:

#include <iostream>
#include <filesystem>
#include <fstream>
#include <algorithm>
#include <chrono>
#include <span>
#include <intrin.h>
#include <array>
#include <functional>
#include "inline.h"

using namespace std;
using namespace filesystem;
using namespace chrono;

template<bool Table>
bool binary( string const &buf );
template<bool Avx512>
bool binaryAvx( string const &buf );

int main()
{
ifstream ifs;
ifs.exceptions( ios_base::failbit | ios_base::badbit );
ifs.open( "main.cpp", ios_base::binary | ios_base::ate );
streampos pos = ifs.tellg();
if( pos > (size_t)-1 )
throw ios_base::failure( "file too large", error_code( (int)errc::file_too_large, generic_category() ) );
string buf( (size_t)pos, 0 );
ifs.seekg( 0 );
ifs.read( buf.data(), buf.size() );
array<double, 4> results;
using test_fn = function<bool ( string const & )>;
auto bench = [&]( size_t i, char const *what, test_fn const &test ) L_FORCEINLINE -> int
{
int ret = 0;
auto start = high_resolution_clock::now();
#if defined(NDEBUG)
constexpr size_t N = 1'000'000;
#else
constexpr size_t N = 1'000;
#endif
for( size_t r = N; r; --r )
ret += test( buf );
double secs = (double)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count() / 1.0e9;
cout << what << ": " << secs;
results[i] = secs;
if( i )
{
cout << " (";
do
{
cout << (int)(100.0 * results[--i] / secs + 0.5) << "%";
if( i )
cout << ", ";
} while( i );
cout << ")";
}
cout << endl;
return ret;
};
struct test { char const *descr; test_fn fn; };
array<test, 4> tests =
{
test( "check", +[]( string const &str ) -> int { return binary<false>( str ); } ),
test( "table", +[]( string const &str ) -> int { return binary<true>( str ); } ),
test( "AVX-256", +[]( string const &str ) -> int { return binaryAvx<false>( str ); } ),
test( "AVX-512", +[]( string const &str ) -> int { return binaryAvx<true>( str ); } )
};
int ret = 0;
for( size_t t = 0; test const &test : tests )
ret += bench( t++, test.descr, test.fn );
return ret;
}

template<bool Table>
bool binary( string const &buf )
{
static auto invalid = []( unsigned char c ) static { return c <
0x20 && c != '\r' && c != '\n' && c != '\t'; };
if constexpr( Table )
{
static vector<char> invalidTbl = Table ? []()
{
vector<char> ret( numeric_limits<unsigned char>::max() );
for( size_t c = ret.size(); c--; )
ret[c] = invalid( (unsigned char)c );
return ret;
}() : vector<char>();
return find_if( buf.begin(), buf.end(), [&]( unsigned char c )
{ return invalidTbl[c]; } ) == buf.end();
}
else
return find_if( buf.begin(), buf.end(), invalid ) == buf.end();
}

template<bool Avx512>
bool binaryAvx( string const &buf )
{
char const
*pBegin = buf.data(),
*pEnd = pBegin + buf.size();
if constexpr( Avx512 )
{
size_t
head = (size_t)pBegin & 63,
tail = (size_t)pEnd & 63;
span<__m512i const> range( (__m512i *)(pBegin - head), (__m512i *)(pEnd - tail + (tail ? 64 : 0)) );
__m512i const
printable = _mm512_set1_epi8( (char)0x20 ),
cr = _mm512_set1_epi8( (char)'\r' ),
lf = _mm512_set1_epi8( (char)'\n' ),
tab = _mm512_set1_epi8( (char)'\t' );
uint64_t mask = (uint64_t)-1ll << head;
auto cur = range.begin(), end = range.end();
auto doChunk = [&]() -> bool
{
__m512i chunk = _mm512_loadu_epi8( (void *)to_address( cur ) );
uint64_t
spaMask = _mm512_cmpge_epu8_mask( chunk, printable ),
crMask = _mm512_cmpeq_epi8_mask( chunk, cr ),
lfMask = _mm512_cmpeq_epi8_mask( chunk, lf ),
tabMask = _mm512_cmpeq_epi8_mask( chunk, tab );
return ((spaMask | crMask | lfMask | tabMask) & mask) == mask;
};
for( ; cur != end - (bool)tail; ++cur, mask = -1ll )
if( !doChunk() )
return false;
if( tail )
{
mask = ~((uint64_t)-1ll << tail);
if( !doChunk() )
return false;
}
}
else
{
size_t
head = (size_t)pBegin & 31,
tail = (size_t)pEnd & 31;
span<__m256i const> range( (__m256i *)(pBegin - head), (__m256i *)(pEnd - tail + (tail ? 32 : 0)) );
__m256i const
zero = _mm256_setzero_si256(),
printable = _mm256_set1_epi8( (char)0xE0 ),
cr = _mm256_set1_epi8( (char)'\r' ),
lf = _mm256_set1_epi8( (char)'\n' ),
tab = _mm256_set1_epi8( (char)'\t' );
uint32_t mask = (uint32_t)-1 << head;
auto cur = range.begin(), end = range.end();
auto doChunk = [&]() -> bool
{
__m256i chunk = _mm256_loadu_epi8( (void *)to_address( cur ) );
uint32_t
spaMask = ~_mm256_movemask_epi8( _mm256_cmpeq_epi8( _mm256_and_si256( chunk, printable ), zero ) ),
crMask = _mm256_movemask_epi8( _mm256_cmpeq_epi8( chunk, cr ) ),
lfMask = _mm256_movemask_epi8( _mm256_cmpeq_epi8( chunk, lf ) ),
tabMask = _mm256_movemask_epi8 (_mm256_cmpeq_epi8( chunk, tab ) );
return ((spaMask | crMask | lfMask | tabMask) & mask) == mask;
};
for( ; cur != end - (bool)tail; ++cur, mask = -1 )
if( !doChunk() )
return false;
if( tail )
{
mask = ~((uint32_t)-1 << tail);
if( !doChunk() )
return false;
}
}
return true;
}

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.lang.c on Wed Dec 10 11:21:32 2025

From Newsgroup: comp.lang.c

On Tue, 9 Dec 2025 16:29:39 -0500
James Kuyper <jameskuyper@alumni.caltech.edu> wrote:

On 2025-12-06 20:37, James Kuyper wrote:
...

"Data read in from a text stream will necessarily compare equal to
the data that were earlier written out to that stream only if: the
data consist only of printing characters and the control characters horizontal tab and new-line; no new-line character is immediately
preceded by space characters; and the last character is a new-line character." (7.23.2p2).

I believe it therefore makes sense to consider something to be a
text file if it meets those requirements, and otherwise is a binary
file. Note that the last requirement implies that an empty file
cannot qualify as text - at a minimum, it must contain a new-line character.

This implies the use of the isprint() function; the only other
characters you need to handle specifically are '\t', '\n', and ' '.
Since the result returned by isprint() is locale-dependent, the
program should, at least optionally, use setlocale().

I just realized an annoying complication. Whatever
implementation-specific method is used to indicate end-of-line can
only be portably identified as such by opening the file in text mode
and looking for the newline characters that it gets converted into.
But because of 7.23.2p2, text mode cannot be relied upon for
precisely the files we're trying to identify.

Does not sound like a problem. According to my understanding, wide
portability was never a part of the OP's spec.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Wed Dec 10 11:35:48 2025

From Newsgroup: comp.lang.c

On Sat, 6 Dec 2025 02:00:22 -0000 (UTC), Lew Pitcher wrote:

I should have added that I feel that you probably haven't really
defined /what/ "text file" means, and that has interfered with
the development of this function. As Keith pointed out, the task
of distinguishing between a "text" file and a "binary" file is not
easy. I'll add that a lot of the difficulty stems from the fact
that there are many definitions (some conflicting) of what a "text"
file actually contains.

Yes. Here's my 2nd attempt following the template (of thinking)
you've suggested...

#include <stdio.h> // FILE, fopen, fread, fclose
#include <stddef.h> // size_t

// is_text_file()
// Returns:
// -1 : could not open file
// 0 : is NOT a text file (binary indicators found)
// 1 : is PROBABLY a text file (no strong binary signatures)

int is_text_file(const char *path) {
// Try opening the file in binary mode,
// required so that bytes are read exact.
FILE *f = fopen(path, "rb");
if (!f) return -1; // Could not open file

unsigned char buf[4096]; // 4KB chunks
size_t n, i;

// Read in file until EOF
while ((n = fread(buf, 1, sizeof(buf), f)) > 0) {
for (i = 0; i < n; i++) {
unsigned char c = buf[i];

// 1. null byte is a very strong indication of binary data.
// Text files virtually never contain 0x00.
if (c == 0x00) {
fclose(f);
return 0; // Contains binary-only byte: NOT text
}

// 2. Check for raw C0 control codes (0x01–0x1F).
// We *allow* \t (09), \n (0A), \r (0D) because they are normal in text.
// Any other control code is highly suspicious and usually means binary.
if (c < 0x20) {
if (c != 0x09 && c != 0x0A && c != 0x0D) {
fclose(f);
return 0; // unexpected control character → NOT text
}
}

// 3. NOTE: We intentionally do *not* reject bytes >= 0x80.
// These occur in UTF-8, extended ASCII, and many local encodings.
// Rejecting them would treat valid multilingual text as binary.
// So we treat high bytes as acceptable for "probably text".
}
}

fclose(f);
return 1; // Probably text (no strong binary signatures found)
}
--
:wq
Mike Sanders
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Wed Dec 10 11:38:47 2025

From Newsgroup: comp.lang.c

On Tue, 9 Dec 2025 16:29:39 -0500, James Kuyper wrote:

[...]

James if you can manage a spare moment, see my reply
to Lew ie - is_text_file()

Would like your critique.
--
:wq
Mike Sanders
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Wed Dec 10 11:41:00 2025

From Newsgroup: comp.lang.c

On Tue, 09 Dec 2025 15:42:59 -0800, Keith Thompson wrote:

[...]

Keith if you get a chance see my reply to Lew 'is_text_file()'

Let me know if I've inched closer a step or two...
--
:wq
Mike Sanders
--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.lang.c on Wed Dec 10 15:07:30 2025

From Newsgroup: comp.lang.c

Michael Sanders <porkchop@invalid.foo> writes:

On Sat, 6 Dec 2025 02:00:22 -0000 (UTC), Lew Pitcher wrote:

I should have added that I feel that you probably haven't really
defined /what/ "text file" means, and that has interfered with
the development of this function. As Keith pointed out, the task
of distinguishing between a "text" file and a "binary" file is not
easy. I'll add that a lot of the difficulty stems from the fact
that there are many definitions (some conflicting) of what a "text"
file actually contains.

Yes. Here's my 2nd attempt following the template (of thinking)
you've suggested...

The problem with all of your attempts is the performance
issue. Success requires reading every single byte of the
file, one byte at a time. The word 'slow' is not sufficient
to describe how bad the performance will be for a very large
file.

At a minimum, dump the stdio double-buffered byte-by-byte
algorithm and use mmap().

In reality, I still don't see any benefit to this type of
heuristic-based approach.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Lew Pitcher@lew.pitcher@digitalfreehold.ca to comp.lang.c on Wed Dec 10 15:58:41 2025

From Newsgroup: comp.lang.c

On Wed, 10 Dec 2025 11:35:48 +0000, Michael Sanders wrote:

On Sat, 6 Dec 2025 02:00:22 -0000 (UTC), Lew Pitcher wrote:

I should have added that I feel that you probably haven't really
defined /what/ "text file" means, and that has interfered with
the development of this function. As Keith pointed out, the task
of distinguishing between a "text" file and a "binary" file is not
easy. I'll add that a lot of the difficulty stems from the fact
that there are many definitions (some conflicting) of what a "text"
file actually contains.

Yes. Here's my 2nd attempt following the template (of thinking)
you've suggested...

FWIW, my opinion doesn't matter in the measure of whether or not you have written a competent is_text_file() function; what matters is that it
fits (or does not fit) the use-case you wrote it for. If it were me,
I'd have a hard time writing this function, because I don't know your
use-case, and I'd try to generalize it. I've worked with text files
stored in ASCII, and in EBCDIC, and in various Unicode formats, and
(god help me) in a bunch of other formats as well, and I'd have a hard
time generalizing all that into a universal is_text_file() function.

So, my real advice is to pick your battles, and document exactly what
sort of text file you intend to look for with this function. What
you've wrote might suit your needs exactly, without accounting for
all the variations of what a text file consists of.

#include <stdio.h> // FILE, fopen, fread, fclose
#include <stddef.h> // size_t

// is_text_file()
// Returns:
// -1 : could not open file
// 0 : is NOT a text file (binary indicators found)
// 1 : is PROBABLY a text file (no strong binary signatures)

int is_text_file(const char *path) {
// Try opening the file in binary mode,
// required so that bytes are read exact.
FILE *f = fopen(path, "rb");
if (!f) return -1; // Could not open file

unsigned char buf[4096]; // 4KB chunks
size_t n, i;

// Read in file until EOF
while ((n = fread(buf, 1, sizeof(buf), f)) > 0) {
for (i = 0; i < n; i++) {
unsigned char c = buf[i];

// 1. null byte is a very strong indication of binary data.
// Text files virtually never contain 0x00.

Except for UTF16 and UTF32 text files, of course.

So, part of your definition of what constitutes a text file is that
a text file (at least as far as is_text_file() is concerned) does not
contain any UTF16 or UTF32 characters.

if (c == 0x00) {
fclose(f);
return 0; // Contains binary-only byte: NOT text
}

// 2. Check for raw C0 control codes (0x01–0x1F).
// We *allow* \t (09), \n (0A), \r (0D) because they are normal in text.
// Any other control code is highly suspicious and usually means binary.
if (c < 0x20) {
if (c != 0x09 && c != 0x0A && c != 0x0D) {

Except for all the flavours of EBCDIC.

So, another part of your definition of what constitutes a text file is that
a text file (at least as far as is_text_file() is concerned) does not contain EBCDIC

fclose(f);
return 0; // unexpected control character → NOT text
}
}

// 3. NOTE: We intentionally do *not* reject bytes >= 0x80.
// These occur in UTF-8, extended ASCII, and many local encodings.
// Rejecting them would treat valid multilingual text as binary.
// So we treat high bytes as acceptable for "probably text".

Except for ASCII, which is limited to 7bit characters between 0x00 and 0x7f (ignoring, of course, those text files that store ASCII with even or odd parity)

So, another part of your definition of what constitutes a text file is that
a text file (at least as far as is_text_file() is concerned) may contain
ASCII, but is not guaranteed to do so.

}
}

fclose(f);
return 1; // Probably text (no strong binary signatures found)
}

--
Lew Pitcher
"In Skills We Trust"
Not LLM output - I'm just like this.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.lang.c on Wed Dec 10 19:00:38 2025

From Newsgroup: comp.lang.c

On Wed, 10 Dec 2025 15:07:30 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael Sanders <porkchop@invalid.foo> writes:

On Sat, 6 Dec 2025 02:00:22 -0000 (UTC), Lew Pitcher wrote:

I should have added that I feel that you probably haven't really
defined /what/ "text file" means, and that has interfered with
the development of this function. As Keith pointed out, the task
of distinguishing between a "text" file and a "binary" file is not
easy. I'll add that a lot of the difficulty stems from the fact
that there are many definitions (some conflicting) of what a "text"
file actually contains.

Yes. Here's my 2nd attempt following the template (of thinking)
you've suggested...

The problem with all of your attempts is the performance
issue. Success requires reading every single byte of the
file, one byte at a time. The word 'slow' is not sufficient
to describe how bad the performance will be for a very large
file.

At a minimum, dump the stdio double-buffered byte-by-byte
algorithm and use mmap().

I suggest to do actual speed measurements before making bold
claims like above. Don't trust your intuition!

In reality, I still don't see any benefit to this type of
heuristic-based approach.

Neither do I. But OP is not doing it for us, but for himself.

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.lang.c on Wed Dec 10 17:18:45 2025

From Newsgroup: comp.lang.c

Michael S <already5chosen@yahoo.com> writes:

On Wed, 10 Dec 2025 15:07:30 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael Sanders <porkchop@invalid.foo> writes:

On Sat, 6 Dec 2025 02:00:22 -0000 (UTC), Lew Pitcher wrote:

I should have added that I feel that you probably haven't really
defined /what/ "text file" means, and that has interfered with
the development of this function. As Keith pointed out, the task
of distinguishing between a "text" file and a "binary" file is not
easy. I'll add that a lot of the difficulty stems from the fact
that there are many definitions (some conflicting) of what a "text"
file actually contains.

Yes. Here's my 2nd attempt following the template (of thinking)
you've suggested...

The problem with all of your attempts is the performance
issue. Success requires reading every single byte of the
file, one byte at a time. The word 'slow' is not sufficient
to describe how bad the performance will be for a very large
file.

At a minimum, dump the stdio double-buffered byte-by-byte
algorithm and use mmap().

I suggest to do actual speed measurements before making bold
claims like above. Don't trust your intuition!

I have, more than once, done such measurements after mmap()
was introduced in SVR4 circa 1989 (ported from SunOS).

On a single-user system, running a single job, the difference
for smaller files is in the noise. For larger files, or when
the system is heavily loaded or multiuser, it can be significant.
--- Synchronet 3.21a-Linux NewsLink 1.2

From James Kuyper@jameskuyper@alumni.caltech.edu to comp.lang.c on Wed Dec 10 12:46:36 2025

From Newsgroup: comp.lang.c

On 2025-12-10 06:35, Michael Sanders wrote:
...

#include <stdio.h> // FILE, fopen, fread, fclose
#include <stddef.h> // size_t

// is_text_file()
// Returns:
// -1 : could not open file
// 0 : is NOT a text file (binary indicators found)
// 1 : is PROBABLY a text file (no strong binary signatures)

int is_text_file(const char *path) {
// Try opening the file in binary mode,
// required so that bytes are read exact.
FILE *f = fopen(path, "rb");
if (!f) return -1; // Could not open file

unsigned char buf[4096]; // 4KB chunks
size_t n, i;

// Read in file until EOF
while ((n = fread(buf, 1, sizeof(buf), f)) > 0) {
for (i = 0; i < n; i++) {
unsigned char c = buf[i];

I'd recommend against buffering this; C stdio is already buffered, and
it just complicates your code to keep track of a second level of
buffering. Use getc() instead.

// 1. null byte is a very strong indication of binary data.
// Text files virtually never contain 0x00.
if (c == 0x00) {
fclose(f);
return 0; // Contains binary-only byte: NOT text
}

// 2. Check for raw C0 control codes (0x01–0x1F).
// We *allow* \t (09), \n (0A), \r (0D) because they are normal in text.
// Any other control code is highly suspicious and usually means binary.
if (c < 0x20) {
if (c != 0x09 && c != 0x0A && c != 0x0D) {
fclose(f);
return 0; // unexpected control character → NOT text
}
}

I would recommend against use of explicit numerical codes for
characters. They make your code dependent upon a particular encoding,
and you're free to make that choice, but for implementations where that encoding is the default, the corresponding C escape sequences will have precisely the the correct value, and make it easier to understand what
your code is doing:

0x00 '\0'
0x09 '\t'
0x0A '\n'
0x0D '\r'
0x20 ' '

--- Synchronet 3.21a-Linux NewsLink 1.2

From James Kuyper@jameskuyper@alumni.caltech.edu to comp.lang.c on Wed Dec 10 12:48:06 2025

From Newsgroup: comp.lang.c

On 2025-12-10 04:21, Michael S wrote:

On Tue, 9 Dec 2025 16:29:39 -0500
James Kuyper <jameskuyper@alumni.caltech.edu> wrote:

...

I just realized an annoying complication. Whatever
implementation-specific method is used to indicate end-of-line can
only be portably identified as such by opening the file in text mode
and looking for the newline characters that it gets converted into.
But because of 7.23.2p2, text mode cannot be relied upon for
precisely the files we're trying to identify.

Does not sound like a problem. According to my understanding, wide portability was never a part of the OP's spec.

His spec was unclear. At least part of my intent in raising these issues
is to point out issues that he might not want to deal with, and which he
can justify ignoring by specifying that his routine is not intended to
deal with them.
Thinking about this particular problem, I see no way to deal with it in general. Had I a need to write such a routine, I'd be happy to restrict
the validity of my code to platforms where end-of-line is is indicated
by a single new-line character. However, I suspect he might need Windows compatibility, and might not need portability to Unix-like systems.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Wed Dec 10 18:41:22 2025

From Newsgroup: comp.lang.c

On Wed, 10 Dec 2025 11:35:48 -0000 (UTC), Michael Sanders wrote:

Yes. Here's my 2nd attempt...

[...]

Last version for me (I have to pivot to other things).

Main change is a look up table, ought to provide
optional future extensibility...

Earnest thanks to each & all =)

#include <stdio.h> // FILE, fopen, fread, fclose
#include <stddef.h> // size_t

// is_text_file()
// Returns:
// -1 : could not open file
// 0 : is NOT a text file (binary indicators found)
// 1 : is PROBABLY a text file (no strong binary signatures)

int is_text_file(const char *path) {
FILE *f = fopen(path, "rb");
if (!f) return -1;

unsigned char chunk[4096]; // 4KB
size_t n, i;

// Look Up Table: 1 = allowed in text, 0 = binary indicator
// Allows TAB(0x09), LF(0x0A), CR(0x0D), printable ASCII (0x20–0x7E)
static const unsigned char LUT[128] = {
0,0,0,0,0,0,0,0,0,1,1,0,0,1,0,0, // 0x00–0x0F
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 0x10–0x1F
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 0x20–0x2F
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 0x30–0x3F
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 0x40–0x4F
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 0x50–0x5F
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 0x60–0x6F
1,1,1,1,1,1,1,1,1,1,1,0 // 0x70–0x7F, last 0 = DEL
};

while ((n = fread(chunk, 1, sizeof(chunk), f)) > 0) {
for (i = 0; i < n; i++) {
if (chunk[i] < 128 && !LUT[chunk[i]]) {
fclose(f);
return 0; // binary indicator found
}
// bytes >= 128 are accepted as probably text
}
}

fclose(f);
return 1; // probably text
}
--
:wq
Mike Sanders
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Wed Dec 10 18:42:47 2025

From Newsgroup: comp.lang.c

On Wed, 10 Dec 2025 15:07:30 GMT, Scott Lurndal wrote:

The problem with all of your attempts is the performance
issue. Success requires reading every single byte of the
file, one byte at a time. The word 'slow' is not sufficient
to describe how bad the performance will be for a very large
file.

At a minimum, dump the stdio double-buffered byte-by-byte
algorithm and use mmap().

In reality, I still don't see any benefit to this type of
heuristic-based approach.

Yeah agreed, its one of those things...
--
:wq
Mike Sanders
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Wed Dec 10 18:44:15 2025

From Newsgroup: comp.lang.c

On Wed, 10 Dec 2025 15:58:41 -0000 (UTC), Lew Pitcher wrote:

[...]

Thanks Lew. I'm stumped, but learned allot.
--
:wq
Mike Sanders
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Wed Dec 10 18:45:39 2025

From Newsgroup: comp.lang.c

On Wed, 10 Dec 2025 12:46:36 -0500, James Kuyper wrote:

I would recommend against use of explicit numerical codes for
characters. They make your code dependent upon a particular encoding,
and you're free to make that choice, but for implementations where that encoding is the default, the corresponding C escape sequences will have precisely the the correct value, and make it easier to understand what
your code is doing:

0x00 '\0'
0x09 '\t'
0x0A '\n'
0x0D '\r'
0x20 ' '

Aye, moving towards that (eventually).

Thanks for your comments James.
--
:wq
Mike Sanders
--- Synchronet 3.21a-Linux NewsLink 1.2

From Richard Heathfield@rjh@cpax.org.uk to comp.lang.c on Wed Dec 10 19:42:24 2025

From Newsgroup: comp.lang.c

On 10/12/2025 17:18, Scott Lurndal wrote:

Michael S <already5chosen@yahoo.com> writes:

On Wed, 10 Dec 2025 15:07:30 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael Sanders <porkchop@invalid.foo> writes:

On Sat, 6 Dec 2025 02:00:22 -0000 (UTC), Lew Pitcher wrote:

I should have added that I feel that you probably haven't really
defined /what/ "text file" means, and that has interfered with
the development of this function. As Keith pointed out, the task
of distinguishing between a "text" file and a "binary" file is not
easy. I'll add that a lot of the difficulty stems from the fact
that there are many definitions (some conflicting) of what a "text"
file actually contains.

Yes. Here's my 2nd attempt following the template (of thinking)
you've suggested...

The problem with all of your attempts is the performance
issue. Success requires reading every single byte of the
file, one byte at a time. The word 'slow' is not sufficient
to describe how bad the performance will be for a very large
file.

At a minimum, dump the stdio double-buffered byte-by-byte
algorithm and use mmap().

I suggest to do actual speed measurements before making bold
claims like above. Don't trust your intuition!

I have, more than once, done such measurements after mmap()
was introduced in SVR4 circa 1989 (ported from SunOS).

On a single-user system, running a single job, the difference
for smaller files is in the noise. For larger files, or when
the system is heavily loaded or multiuser, it can be significant.

1989 is 36 years ago. Technology has moved on. If reading your
file is too slow to read, get yourself a real computer.

On my very ordinary desktop machine, I just freq'd[1] a
7,032,963,565-byte file in 12.256 seconds. That's 573,838,410
bytes per second. It's a damn sight faster than I could do by hand.

How, exactly, are you using `slow'?

[1] Nothing fancy; a getc loop with ++pfm[ch].count written
entirely in what used to be called clc-conforming code, and I can
see at least one egregious inefficiency in the code that I can't
be bothered to fix because half a gig a second is *easily* fast
enough for my needs.
--
Richard Heathfield
Email: rjh at cpax dot org dot uk
"Usenet is a strange place" - dmr 29 July 1999
Sig line 4 vacant - apply within
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Wed Dec 10 20:57:49 2025

From Newsgroup: comp.lang.c

On Wed, 10 Dec 2025 18:41:22 -0000 (UTC), Michael Sanders wrote:

Last version for me (I have to pivot to other things).

[...]

smaller look up table still + bit shifting!

*fastest implantation yet* but virtually unreadable =(

#include <stdio.h>
#include <stddef.h>
#include <stdint.h>

// is_text_file()
// Returns:
// -1 : could not open file
// 0 : is NOT a text file (binary indicators found)
// 1 : is PROBABLY a text file (no strong binary signatures)

int is_text_file(const char *path) {
FILE *f = fopen(path, "rb");
if (!f) return -1;

unsigned char chunk[4096];
size_t n, i;

// 128-bit bitmask (16 bytes × 8 bits / byte), 1=allowed, 0=disallowed
// Allowed bytes: TAB(0x09), LF(0x0A), CR(0x0D), printable ASCII 0x20–0x7E

static const uint8_t MASK[16] = {
0x00, 0x24, 0x00, 0x00, // 0x00–0x0F: TAB(09), LF(0A), CR(0D)
0xFF, 0xFF, 0xFF, 0xFF, // 0x10–0x2F: SPC!"#$%&'()*+,-./
0xFF, 0xFF, 0xFF, 0xFF, // 0x30–0x4F: 0123456789:;<=>?@
0xFF, 0xFF, 0xFF, 0x7F // 0x50–0x7F: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdef...
};

while ((n = fread(chunk, 1, sizeof(chunk), f)) > 0) {
for (i = 0; i < n; i++) {
if (chunk[i] < 128 && !(MASK[chunk[i] >> 3] & (1 << (chunk[i] & 7)))) {
fclose(f);
return 0; // binary indicator found
}
// bytes >= 128 are accepted as probably text
}
}

fclose(f);
return 1; // probably text
}
--
:wq
Mike Sanders
--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.lang.c on Wed Dec 10 22:07:24 2025

From Newsgroup: comp.lang.c

Michael Sanders <porkchop@invalid.foo> writes:

On Wed, 10 Dec 2025 18:41:22 -0000 (UTC), Michael Sanders wrote:

Last version for me (I have to pivot to other things).

[...]

smaller look up table still + bit shifting!

*fastest implantation yet* but virtually unreadable =(

#include <stdio.h>
#include <stddef.h>
#include <stdint.h>

// is_text_file()
// Returns:
// -1 : could not open file
// 0 : is NOT a text file (binary indicators found)
// 1 : is PROBABLY a text file (no strong binary signatures)

int is_text_file(const char *path) {
FILE *f = fopen(path, "rb");
if (!f) return -1;

unsigned char chunk[4096];
size_t n, i;

// 128-bit bitmask (16 bytes × 8 bits / byte), 1=allowed, 0=disallowed
// Allowed bytes: TAB(0x09), LF(0x0A), CR(0x0D), printable ASCII 0x20–0x7E

static const uint8_t MASK[16] = {
0x00, 0x24, 0x00, 0x00, // 0x00–0x0F: TAB(09), LF(0A), CR(0D)
0xFF, 0xFF, 0xFF, 0xFF, // 0x10–0x2F: SPC!"#$%&'()*+,-./
0xFF, 0xFF, 0xFF, 0xFF, // 0x30–0x4F: 0123456789:;<=>?@
0xFF, 0xFF, 0xFF, 0x7F // 0x50–0x7F: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdef...
};

while ((n = fread(chunk, 1, sizeof(chunk), f)) > 0) {
for (i = 0; i < n; i++) {
if (chunk[i] < 128 && !(MASK[chunk[i] >> 3] & (1 << (chunk[i] & 7)))) {
fclose(f);
return 0; // binary indicator found
}
// bytes >= 128 are accepted as probably text

Typically a soi disant extended ASCII character set (e.g. ISO-8859-1)
have the first 32 bytes starting at 128 defined as control characters.

https://en.wikipedia.org/wiki/ISO/IEC_8859-1#Code_page_layout
--- Synchronet 3.21a-Linux NewsLink 1.2

From bart@bc@freeuk.com to comp.lang.c on Wed Dec 10 22:37:48 2025

From Newsgroup: comp.lang.c

On 10/12/2025 19:42, Richard Heathfield wrote:

On 10/12/2025 17:18, Scott Lurndal wrote:

Michael S <already5chosen@yahoo.com> writes:

On Wed, 10 Dec 2025 15:07:30 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael Sanders <porkchop@invalid.foo> writes:

On Sat, 6 Dec 2025 02:00:22 -0000 (UTC), Lew Pitcher wrote:

I should have added that I feel that you probably haven't really
defined /what/ "text file" means, and that has interfered with
the development of this function. As Keith pointed out, the task
of distinguishing between a "text" file and a "binary" file is not >>>>>> easy. I'll add that a lot of the difficulty stems from the fact
that there are many definitions (some conflicting) of what a "text" >>>>>> file actually contains.

Yes. Here's my 2nd attempt following the template (of thinking)
you've suggested...

The problem with all of your attempts is the performance
issue. Success requires reading every single byte of the
file, one byte at a time. The word 'slow' is not sufficient
to describe how bad the performance will be for a very large
file.

At a minimum, dump the stdio double-buffered byte-by-byte
algorithm and use mmap().

I suggest to do actual speed measurements before making bold
claims like above. Don't trust your intuition!

I have, more than once, done such measurements after mmap()
was introduced in SVR4 circa 1989 (ported from SunOS).

On a single-user system, running a single job, the difference
for smaller files is in the noise. For larger files, or when
the system is heavily loaded or multiuser, it can be significant.

1989 is 36 years ago. Technology has moved on. If reading your file is
too slow to read, get yourself a real computer.

On my very ordinary desktop machine, I just freq'd[1] a 7,032,963,565-
byte file in 12.256 seconds. That's 573,838,410 bytes per second. It's a damn sight faster than I could do by hand.

How, exactly, are you using `slow'?

A getc loop took 4.3 seconds to read a 192MB file from SSD, on my
Windows PC.

Under WSL it took 8.4 seconds (8.4/0.5 real/user).

However reading it all in one go took 0.14 seconds.

I guess not all 'getc' implementations are the same.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Keith Thompson@Keith.S.Thompson+u@gmail.com to comp.lang.c on Wed Dec 10 15:20:19 2025

From Newsgroup: comp.lang.c

Michael Sanders <porkchop@invalid.foo> writes:

On Tue, 09 Dec 2025 15:42:59 -0800, Keith Thompson wrote:

[...]

Keith if you get a chance see my reply to Lew 'is_text_file()'

Let me know if I've inched closer a step or two...

Closer to what exactly?

In the parent article, I suggested that you likely don't need to
determine whether a file is "text" or "binary". You said you want
to parse a file. An attempt to parse it will fail either if the
input is binary or if it's text that doesn't match the grammar you
require. For example, a parser for C source code doesn't need to
check whether the input is binary or text. Certain input
characters will simply cause the parse to fail, and a syntax error
can be reported. Tell us more about how you want to parse files.
Are you parsing according to a formal grammar? Or is it more
ad-hoc?
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Wed Dec 10 23:59:44 2025

From Newsgroup: comp.lang.c

On Wed, 10 Dec 2025 15:20:19 -0800, Keith Thompson wrote:

Michael Sanders <porkchop@invalid.foo> writes:

On Tue, 09 Dec 2025 15:42:59 -0800, Keith Thompson wrote:

[...]

Keith if you get a chance see my reply to Lew 'is_text_file()'

Let me know if I've inched closer a step or two...

Closer to what exactly?

In the parent article, I suggested that you likely don't need to
determine whether a file is "text" or "binary". You said you want
to parse a file. An attempt to parse it will fail either if the
input is binary or if it's text that doesn't match the grammar you
require. For example, a parser for C source code doesn't need to
check whether the input is binary or text. Certain input
characters will simply cause the parse to fail, and a syntax error
can be reported. Tell us more about how you want to parse files.
Are you parsing according to a formal grammar? Or is it more
ad-hoc?

Yes I'm parsing a formal grammar (but a *really* small one).

Yes I can parse binary/text just fine as you guessed.

The matter at hand:

I wanted to build a stand alone function that makes a solid guess as
to whether a file would be considered an average text file or not.

That's all...

I've solved the issue to my satisfaction.
--
:wq
Mike Sanders
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Thu Dec 11 01:09:59 2025

From Newsgroup: comp.lang.c

On Wed, 10 Dec 2025 22:07:24 GMT, Scott Lurndal wrote:

Typically a soi disant extended ASCII character set (e.g. ISO-8859-1)
have the first 32 bytes starting at 128 defined as control characters.

https://en.wikipedia.org/wiki/ISO/IEC_8859-1#Code_page_layout

Many thanks Scott. Here's my final stab at the idea.

Beware word-wrap...

#include <stdio.h>
#include <stdint.h>

/*
* is_text_file()
*
* Determines whether a file is "probably text" or binary, using a heuristic
* based on mostly printable characters.
*
* Detection modes:
* TEXT_LOOSE - Allows ASCII printable bytes (0x20–0x7E), TAB/LF/CR,
* and all high-bit bytes (>=128). Tolerant for UTF-8 or
* ISO-8859-1 text.
* TEXT_STRICT - Rejects ASCII control characters (0x00–0x08, 0x0B–0x0C,
* 0x0E–0x1F) and C1 controls (0x80–0x9F). Counts only
* clearly printable bytes.
* TEXT_ISO8859_1 - Accepts ASCII printable (0x20–0x7E), ISO-8859-1
* printable bytes (0xA0–0xFF), and TAB/LF/CR. Rejects
* C1 controls (0x80–0x9F).
*
* Returns:
* 1 file is probably text (>=90% printable characters)
* 0 file is probably binary (too many non-printable characters)
* -1 empty file
* -2 could not open file
*/

typedef enum {
TEXT_LOOSE, // mostly printable: ASCII + high-bit
TEXT_STRICT, // stricter: reject C1 controls
TEXT_ISO8859_1 // ISO-8859-1 printable (0x20–0x7E + 0xA0–0xFF)
} text_mode_t;

static const uint8_t MASK[16] = {
0x00, 0x24, 0x00, 0x00, // 0x00–0x0F: TAB(09), LF(0A), CR(0D)
0xFF, 0xFF, 0xFF, 0xFF, // 0x10–0x2F: SPC!"#$%&'()*+,-./
0xFF, 0xFF, 0xFF, 0xFF, // 0x30–0x4F: 0123456789:;<=>?@
0xFF, 0xFF, 0xFF, 0x7F // 0x50–0x7F: A–Z [\]^_` a–z (exclude DEL)
};

int is_text_file(const char *path, text_mode_t mode) {
FILE *f = fopen(path, "rb");
if (!f) return -2;

unsigned char chunk[4096];
uint64_t n, i, good = 0, total = 0;

while ((n = fread(chunk, 1, sizeof(chunk), f)) > 0) {
total += n;

for (i = 0; i < n; i++) {
unsigned char c = chunk[i];

switch (mode) {
case TEXT_LOOSE:
if (c >= 128 || (c < 128 && (MASK[c >> 3] & (1 << (c & 7))))) good++;
break;

case TEXT_STRICT: // reject C1 controls 0x80–0x9F
if ((c >= 128 && c <= 159) || (c < 128 && !(MASK[c >> 3] & (1 << (c & 7))))) {
// bad byte, do not count...
} else good++;
break;

case TEXT_ISO8859_1: // accept 0x20–0x7E + 0xA0–0xFF, reject C1 controls
if ((c >= 0x20 && c <= 0x7E) || (c >= 0xA0 && c <= 0xFF)
|| c == 0x09 || c == 0x0A || c == 0x0D) { good++; }
break;
}
}
}

fclose(f);

if (total == 0) return -1; // empty file

return (good * 10 >= total * 9) ? 1 : 0; // 90% threshold
}
--
:wq
Mike Sanders
--- Synchronet 3.21a-Linux NewsLink 1.2

From Paul@nospam@needed.invalid to comp.lang.c on Wed Dec 10 22:35:53 2025

From Newsgroup: comp.lang.c

On Wed, 12/10/2025 5:37 PM, bart wrote:

On 10/12/2025 19:42, Richard Heathfield wrote:

On 10/12/2025 17:18, Scott Lurndal wrote:

Michael S <already5chosen@yahoo.com> writes:

On Wed, 10 Dec 2025 15:07:30 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael Sanders <porkchop@invalid.foo> writes:

On Sat, 6 Dec 2025 02:00:22 -0000 (UTC), Lew Pitcher wrote:

I should have added that I feel that you probably haven't really >>>>>>> defined /what/ "text file" means, and that has interfered with
the development of this function. As Keith pointed out, the task >>>>>>> of distinguishing between a "text" file and a "binary" file is not >>>>>>> easy. I'll add that a lot of the difficulty stems from the fact
that there are many definitions (some conflicting) of what a "text" >>>>>>> file actually contains.

Yes. Here's my 2nd attempt following the template (of thinking)
you've suggested...

The problem with all of your attempts is the performance
issue. Success requires reading every single byte of the
file, one byte at a time. The word 'slow' is not sufficient
to describe how bad the performance will be for a very large
file.

At a minimum, dump the stdio double-buffered byte-by-byte
algorithm and use mmap().

I suggest to do actual speed measurements before making bold
claims like above. Don't trust your intuition!

I have, more than once, done such measurements after mmap()
was introduced in SVR4 circa 1989 (ported from SunOS).

On a single-user system, running a single job, the difference
for smaller files is in the noise. For larger files, or when
the system is heavily loaded or multiuser, it can be significant.

1989 is 36 years ago. Technology has moved on. If reading your file is too slow to read, get yourself a real computer.

On my very ordinary desktop machine, I just freq'd[1] a 7,032,963,565- byte file in 12.256 seconds. That's 573,838,410 bytes per second. It's a damn sight faster than I could do by hand.

How, exactly, are you using `slow'?

A getc loop took 4.3 seconds to read a 192MB file from SSD, on my Windows PC.

Under WSL it took 8.4 seconds (8.4/0.5 real/user).

However reading it all in one go took 0.14 seconds.

I guess not all 'getc' implementations are the same.

#include <stdio.h>
#include <stdlib.h>
#include <windows.h>

/* gcc -Wl,--stack,1200000000 -o getcbench.exe getcbench.c */

int main(int argc, char **argv)
{ FILE* source;

int c; /* getc holder */
const int size = 1000*1000*1000;
char keep[size];
int i=0;

printf( "\nWelcome to getcbench.exe\n\n" );

__int64 time1 = 0, time2 = 0, freq = 0; /* code added for timestamp */

if (argc != 2) {
fprintf(stderr, "Usage: %s source_file\n", argv[0]);
return -1;
}

printf( "Array ready, opening file %s\n", argv[1] );

source = fopen(argv[1], "rb");
if (!source) {
fprintf(stderr, "Could not open %s\n", argv[1]);
return -1;
}

QueryPerformanceCounter((LARGE_INTEGER *) &time1); /* clock is running */
QueryPerformanceFrequency((LARGE_INTEGER *)&freq);
printf("time1 = %llX freq = %lld \n", time1, freq);

while ((c = getc(source)) != EOF) {
keep[i++] = c;
if (i >= size) break;
}

QueryPerformanceCounter((LARGE_INTEGER *) &time2);
printf("time2 = %llX \n", time2);

printf("Read %d bytes in %010.6f seconds\n", i, (float)(time2-time1)/freq); }

$ getcbench.exe D:\test.txt # D: is capable of gigabytes per second speeds

Welcome to getcbench.exe

Array ready, opening file D:test.txt
time1 = 3380876B31 freq = 10000000
time2 = 338D011DCC
Read 1000000000 bytes in 020.930217 seconds # Process Monitor shows that 4096 byte reads are being done

$

***************************************************************

This has additional gubbins.

https://en.cppreference.com/w/c/io/setvbuf

Add some code after the fopen.

if (setvbuf(source, NULL, _IOFBF, 65536) != 0)
{
fprintf(stderr, "setvbuf() failed\n\n" );
return -1;
}

Process Monitor shows the reads now happen in 65536 chunks.

But this does not do a thing for performance (with this style of I/O and no optimization).

$ getcbenchbuf.exe D:\test.txt

Welcome to getcbenchbuf.exe

Array ready, opening file D:test.txt
time1 = 37192A7827 freq = 10000000
time2 = 37256FEAFA
Read 1000000000 bytes in 020.587797 seconds

***************************************************************

If I do this to the original program (-O2), it still is
doing 4096 byte reads, but the performance is better.

$ gcc -O2 -Wl,--stack,1200000000 -o getcbench.exe getcbench.c

$ getcbench.exe D:\\test2.txt

Welcome to getcbench.exe

Array ready, opening file D:\test2.txt
time1 = 3B4D7C1022 freq = 10000000
time2 = 3B4E5EB775
Read 1000000000 bytes in 001.485397 seconds

Busy sum = FFFFFFFFE216FE9C

Extra code was added so keep[] was not optimized away.

for (k = 0; k<i; k++) sum += keep[k];
printf("Busy sum = %llX\n", sum);

That's about 673MB/sec.

The version with the setvbuf, is still reading 65536 byte chunks.

$ gcc -O2 -Wl,--stack,1200000000 -o getcbenchbuf.exe getcbenchbuf.c

$ getcbenchbuf.exe D:\\test2.txt

Welcome to getcbenchbuf.exe

Array ready, opening file D:\test2.txt
time1 = 3C1EA5ACDF freq = 10000000
time2 = 3C1F49CE7D
Read 1000000000 bytes in 001.075651 seconds

Busy sum = FFFFFFFFE216FE9C

That's getting close to a gigabyte per second.

Summary: The -O2 makes a BIG difference.
No idea how it is cheating.

Paul
--- Synchronet 3.21a-Linux NewsLink 1.2

From bart@bc@freeuk.com to comp.lang.c on Thu Dec 11 11:46:19 2025

From Newsgroup: comp.lang.c

On 11/12/2025 03:35, Paul wrote:

On Wed, 12/10/2025 5:37 PM, bart wrote:

A getc loop took 4.3 seconds to read a 192MB file from SSD, on my Windows PC.

Under WSL it took 8.4 seconds (8.4/0.5 real/user).

However reading it all in one go took 0.14 seconds.

I guess not all 'getc' implementations are the same.

#include <stdio.h>
#include <stdlib.h>
#include <windows.h>

/* gcc -Wl,--stack,1200000000 -o getcbench.exe getcbench.c */

int main(int argc, char **argv)
{ FILE* source;

int c; /* getc holder */
const int size = 1000*1000*1000;
char keep[size];

I didn't see the point of either keeping the array on the stack, or
using a VLA. I made it static. That also allowed me a choice of
compilers with no special options needed.

Add some code after the fopen.

if (setvbuf(source, NULL, _IOFBF, 65536) != 0)
{
fprintf(stderr, "setvbuf() failed\n\n" );
return -1;
}

When I added that, it slowed it down! Maybe it was already using a
bigger buffer.

Extra code was added so keep[] was not optimized away.

My loop didn't store the characters anywhere; it just bumped a count.

I think it was enough that it was calling an external function, 'getc';
a commpiler can't optimise that away.

Read 1000000000 bytes in 001.075651 seconds

Busy sum = FFFFFFFFE216FE9C

That's getting close to a gigabyte per second.

Summary: The -O2 makes a BIG difference.
No idea how it is cheating.

How a look at the generated assembly: is it still making an actual call
to 'getc', or has it been inlined?

In my case -O2 made little difference, and it was still calling getc().
-O2 can't effect such a precompiled function, unless getc() is not
really an external function: either a macro, or a wrapper.

Also, the generated EXE file actually imports getc from msvcrt.dll,
which is a library not known to be performant.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Bonita Montero@Bonita.Montero@gmail.com to comp.lang.c on Thu Dec 11 12:53:13 2025

From Newsgroup: comp.lang.c

Am 11.12.2025 um 12:46 schrieb bart:

On 11/12/2025 03:35, Paul wrote:

On Wed, 12/10/2025 5:37 PM, bart wrote:

A getc loop took 4.3 seconds to read a 192MB file from SSD, on my
Windows PC.

Under WSL it took 8.4 seconds (8.4/0.5 real/user).

However reading it all in one go took 0.14 seconds.

I guess not all 'getc' implementations are the same.

#include <stdio.h>
#include <stdlib.h>
#include <windows.h>

/* gcc -Wl,--stack,1200000000 -o getcbench.exe getcbench.c */

int main(int argc, char **argv)
{ FILE* source;

    int c;                                      /* getc holder */
    const int size = 1000*1000*1000;
    char keep[size];

I didn't see the point of either keeping the array on the stack, or
using a VLA. I made it static. That also allowed me a choice of
compilers with no special options needed.

Yes. Under Linux/x64 the default stack size is 8MiB, unter Windows/x64
one MiB.
That's a stack overflow - or should I call it underflow since it grows downards
- for sure.

Add some code after the fopen.

    if (setvbuf(source, NULL, _IOFBF, 65536) != 0)
    {
         fprintf(stderr, "setvbuf() failed\n\n" );
         return -1;
    }

When I added that, it slowed it down! Maybe it was already using a
bigger buffer.

Extra code was added so keep[] was not optimized away.

My loop didn't store the characters anywhere; it just bumped a count.

I think it was enough that it was calling an external function,
'getc'; a commpiler can't optimise that away.

Read 1000000000 bytes in 001.075651 seconds

Busy sum = FFFFFFFFE216FE9C

That's getting close to a gigabyte per second.

Summary: The -O2 makes a BIG difference.
          No idea how it is cheating.

How a look at the generated assembly: is it still making an actual
call to 'getc', or has it been inlined?

In my case -O2 made little difference, and it was still calling
getc(). -O2 can't effect such a precompiled function, unless getc() is
not really an external function: either a macro, or a wrapper.

Also, the generated EXE file actually imports getc from msvcrt.dll,
which is a library not known to be performant.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Thu Dec 11 12:33:14 2025

From Newsgroup: comp.lang.c

On Thu, 11 Dec 2025 01:09:59 -0000 (UTC), Michael Sanders wrote:

[...]

if (c >= 128 || (c < 128 && (MASK[c >> 3] & (1 << (c & 7))))) good++;

[...]

Thinking about it more, the bit-twiddling method while fast,
is certainly not very readable/maintainable. Those who might
want to use any of the variations I've written, will best be
served using the one shown below. Not all the bells & whistles
of the prior offering, but sometimes that's good thing.

Note: If you keep map[] 'out in the open' (globally exposed)
its only computed once at runtime instead everytime...

Well off to work for me.

#include <stdio.h>
#include <stdint.h>

/*
* is_text_file()
*
* Determines whether a file is 'probably text' based on ISO-8859-1 rules.
* Uses a precomputed lookup table for fast byte validation.
*
* Valid bytes:
* - ASCII printable: 0x20–0x7E
* - ISO-8859-1 high printable: 0xA0–0xFF
* - Whitespace/control: TAB (0x09), LF (0x0A), CR (0x0D)
*
* Invalid bytes (binary indicators):
* - NULL byte (0x00)
* - C0 controls (0x01–0x08, 0x0B–0x0C, 0x0E–0x1F)
* - DEL (0x7F)
* - C1 controls (0x80–0x9F)
*
* Returns:
* 1 - file is considered text
* 0 - file is considered binary
* -1 - could not open file
*/

static const uint8_t map[256] = {
0,0,0,0,0,0,0,0,0,1,1,0,0,1,0,0, // 00
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 10
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 20
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 30
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 40
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 50
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 60
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0, // 70
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 80
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 90
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // A0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // B0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // C0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // D0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // E0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 // F0
};

int is_text_file(const char *path) {

FILE *f = fopen(path, "rb");
if (!f) return -1; // could not open file

// larger chunk size means less 'touching' the drive
unsigned char chunk[65536];
size_t n, i;

while ((n = fread(chunk, 1, sizeof(chunk), f)) > 0) {
for (i = 0; i < n; i++) {
if (!map[chunk[i]]) {
fclose(f);
return 0; // binary detected
}
}
}

fclose(f);
return 1; // probally text
}

// eof
--
:wq
Mike Sanders
--- Synchronet 3.21a-Linux NewsLink 1.2

From Kaz Kylheku@046-301-5902@kylheku.com to comp.lang.c on Thu Dec 11 17:33:43 2025

From Newsgroup: comp.lang.c

On 2025-12-06, Michael Sanders <porkchop@invalid.foo> wrote:

Am I close? Missing anything you'd consider to be (or not) needed?

Hi Michael,

I contract for the the defense industry and badly need this function!

I am working with proposed code like:

if (is_binary_file(arg))
launch_nuclear_strike();

So I'm really sweating over the implementation, as you can imagine.

This thread has been very helpful.

I'm still leaning toward my paranoid functionw hich just checks that
every bit of every byte is either 0 or 1 to confirm that the binary
system is used.

In the I/O error case, I will cautiously return a a true value; we would
not want our side to lose due to a storage hardware issue.
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca
--- Synchronet 3.21a-Linux NewsLink 1.2

From Bonita Montero@Bonita.Montero@gmail.com to comp.lang.c on Thu Dec 11 19:10:03 2025

From Newsgroup: comp.lang.c

Please take my AVX-512 code.
It's that fast that your nuclear strike hits first and you won't get hit
by enemy.

Am 11.12.2025 um 18:33 schrieb Kaz Kylheku:

On 2025-12-06, Michael Sanders <porkchop@invalid.foo> wrote:

Am I close? Missing anything you'd consider to be (or not) needed?

Hi Michael,

I contract for the the defense industry and badly need this function!

I am working with proposed code like:

if (is_binary_file(arg))
launch_nuclear_strike();

So I'm really sweating over the implementation, as you can imagine.

This thread has been very helpful.

I'm still leaning toward my paranoid functionw hich just checks that
every bit of every byte is either 0 or 1 to confirm that the binary
system is used.

In the I/O error case, I will cautiously return a a true value; we would
not want our side to lose due to a storage hardware issue.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.lang.c on Thu Dec 11 14:56:34 2025

From Newsgroup: comp.lang.c

On 12/11/2025 9:33 AM, Kaz Kylheku wrote:

On 2025-12-06, Michael Sanders <porkchop@invalid.foo> wrote:

Am I close? Missing anything you'd consider to be (or not) needed?

Hi Michael,

I contract for the the defense industry and badly need this function!

I am working with proposed code like:

if (is_binary_file(arg))
launch_nuclear_strike();

any launch_biotoxic_strike(...) in there?

;^) rofl.

So I'm really sweating over the implementation, as you can imagine.

This thread has been very helpful.

I'm still leaning toward my paranoid functionw hich just checks that
every bit of every byte is either 0 or 1 to confirm that the binary
system is used.

In the I/O error case, I will cautiously return a a true value; we would
not want our side to lose due to a storage hardware issue.

oh my! ;^D
--- Synchronet 3.21a-Linux NewsLink 1.2

From James Kuyper@jameskuyper@alumni.caltech.edu to comp.lang.c on Thu Dec 11 18:15:15 2025

From Newsgroup: comp.lang.c

On 2025-12-11 12:33, Kaz Kylheku wrote:
...

I'm still leaning toward my paranoid functionw hich just checks that
every bit of every byte is either 0 or 1 to confirm that the binary
system is used.

I'd be very interested in seeing how you implement that test, and even
more interested in what the test data looks like that you use to confirm
that a failure of that test is correctly flagged. :-)

--- Synchronet 3.21a-Linux NewsLink 1.2

From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.c on Fri Dec 12 02:19:17 2025

From Newsgroup: comp.lang.c

On 2025-12-11 18:33, Kaz Kylheku wrote:

On 2025-12-06, Michael Sanders <porkchop@invalid.foo> wrote:

Am I close? Missing anything you'd consider to be (or not) needed?

Hi Michael,

I contract for the the defense industry and badly need this function!

I am working with proposed code like:

if (is_binary_file(arg))
launch_nuclear_strike();

else
negotiate_peace_conditions(arg);

I think it's a waste of information to identify some 'arg' as text
and not assume it to be a negotiation proposal for peace treaties!

Or would that be considered just unnecessary feature creep? - Just
bloating the code and having negative impact on runtime performance?
(A few milliseconds could certainly make a difference here between
victory or defeat!)

[...]

In the I/O error case, I will cautiously return a a true value; we would
not want our side to lose due to a storage hardware issue.

A very considerate decision. Kudos!

Janis

LOL - you made my day, Kaz!

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Fri Dec 12 19:25:41 2025

From Newsgroup: comp.lang.c

On Thu, 11 Dec 2025 12:33:14 -0000 (UTC), Michael Sanders wrote:

static const uint8_t map[256] = {...

added 'plugin' maps...

#include <stdio.h>
#include <stdint.h>

/*
* map_strict[]
*
* Valid bytes:
* - ASCII printable: 0x20–0x7E
* - ISO-8859-1 high printable: 0xA0–0xFF
* - Whitespace/control: TAB (0x09), LF (0x0A), CR (0x0D)
*
* Invalid bytes (binary indicators):
* - NULL byte (0x00)
* - C0 controls (0x01–0x08, 0x0B–0x0C, 0x0E–0x1F)
* - DEL (0x7F)
* - C1 controls (0x80–0x9F)
*/

static const uint8_t map_strict[256] = {
0,0,0,0,0,0,0,0,0,1,1,0,0,1,0,0, // 00
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 10
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 20
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 30
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 40
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 50
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 60
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0, // 70
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 80
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 90
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // A0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // B0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // C0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // D0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // E0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 // F0
};

/*
* map_loose[]
*
* Valid bytes:
* - ASCII printable characters: 0x20–0x7E
* - Whitespace/control characters: TAB (0x09), LF (0x0A), CR (0x0D)
* - High bytes: 0x80–0xFF
*
* Invalid bytes (binary indicators):
* - NULL byte: 0x00
* - C0 control codes: 0x01–0x08, 0x0B–0x0C, 0x0E–0x1F
* - DEL character: 0x7F
*/

static const uint8_t map_loose[256] = {
0,0,0,0,0,0,0,0,0,1,1,0,0,1,0,0, // 00
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 10
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 20
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 30
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 40
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 50
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 60
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 70
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 80
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 90
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // A0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // B0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // C0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // D0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // E0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 // F0
};

/*
* is_text_file()
*
* just plugin in your own map[]...
*
* Returns:
* 1 - text
* 0 - binary
* -1 - could not open
*/

int is_text_file(const char *path, const uint8_t map[256]) {
FILE *f = fopen(path, "rb");
if (!f) return -1; // could not open file

// 4KB: 4096, 8KB: 8192, 16KB: 16384, 32KB: 32768, 64KB: 65536
unsigned char buf[65536];
size_t n, i;

while ((n = fread(buf, 1, sizeof(buf), f)) > 0) {
for (i = 0; i < n; i++) {
if (!map[buf[i]]) {
fclose(f);
return 0; // not text (binary indicators)
}
}
}

fclose(f);
return 1; // probably text
}

// eof
--
:wq
Mike Sanders
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Fri Dec 12 22:54:58 2025

From Newsgroup: comp.lang.c

On Fri, 12 Dec 2025 19:25:41 -0000 (UTC), Michael Sanders wrote:

[...]

Done.

Features...

- plugin maps
- follows sylinks
- rejects directories, devices, sockets

#include <stdio.h>
#include <stdint.h>
#include <sys/stat.h>

/*
* map_strict[]
*
* Valid bytes:
* - ASCII printable: 0x20–0x7E
* - ISO-8859-1 high printable: 0xA0–0xFF
* - Whitespace/control: TAB (0x09), LF (0x0A), CR (0x0D)
*
* Invalid bytes (binary indicators):
* - NULL byte (0x00)
* - C0 controls (0x01–0x08, 0x0B–0x0C, 0x0E–0x1F)
* - DEL (0x7F)
* - C1 controls (0x80–0x9F)
*/

static const uint8_t map_strict[256] = {
0,0,0,0,0,0,0,0,0,1,1,0,0,1,0,0, // 00
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 10
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 20
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 30
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 40
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 50
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 60
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0, // 70
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 80
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 90
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // A0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // B0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // C0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // D0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // E0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 // F0
};

/*
* map_loose[]
*
* Valid bytes:
* - ASCII printable characters: 0x20–0x7E
* - Whitespace/control characters: TAB (0x09), LF (0x0A), CR (0x0D)
* - High bytes: 0x80–0xFF
*
* Invalid bytes (binary indicators):
* - NULL byte: 0x00
* - C0 control codes: 0x01–0x08, 0x0B–0x0C, 0x0E–0x1F
* - DEL character: 0x7F
*/

static const uint8_t map_loose[256] = {
0,0,0,0,0,0,0,0,0,1,1,0,0,1,0,0, // 00
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 10
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 20
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 30
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 40
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 50
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 60
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 70
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 80
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 90
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // A0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // B0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // C0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // D0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // E0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 // F0
};

/*
* is_text_file()
*
* just plug in your own map[]...
*
* Returns:
* 1 - text
* 0 - binary indicator
* -1 - could not open
*/

int is_text_file(const char *path, const uint8_t map[256]) {

// now we follow symlinks...
struct stat st;
if (stat(path, &st) != 0) return -1; // can not access file
if (!S_ISREG(st.st_mode)) return -1; // reject: directories/devices/sockets

FILE *f = fopen(path, "rb");
if (!f) return -1; // could not open file

// 4KB: 4096, 8KB: 8192, 16KB: 16384, 32KB: 32768, 64KB: 65536
unsigned char buf[16384];
size_t n, i;

while ((n = fread(buf, 1, sizeof(buf), f)) > 0) {
for (i = 0; i < n; i++) {
if (!map[buf[i]]) {
fclose(f);
return 0; // not text (binary indicator detected)
}
}
}

fclose(f);
return 1; // probally text
}

// eof
--
:wq
Mike Sanders
--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.lang.c on Fri Dec 12 15:33:01 2025

From Newsgroup: comp.lang.c

On 12/12/2025 2:54 PM, Michael Sanders wrote:
[...]

fclose(f);
return 1; // probally text
}

define the probability? Say in 0...1?

[...]

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Sat Dec 13 00:20:48 2025

From Newsgroup: comp.lang.c

On Fri, 12 Dec 2025 15:33:01 -0800, Chris M. Thomasson wrote:

On 12/12/2025 2:54 PM, Michael Sanders wrote:
[...]

fclose(f);
return 1; // probally text
}

define the probability? Say in 0...1?

[...]

Add it Chris & I'll roll it in =)

Me? I'd go with steps of say, 10% just to
make it human-friendly, but that's just me.
--
:wq
Mike Sanders
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael Sanders@porkchop@invalid.foo to comp.lang.c on Sat Dec 13 02:32:57 2025

From Newsgroup: comp.lang.c

On Sat, 13 Dec 2025 00:20:48 -0000 (UTC), Michael Sanders wrote:

On Fri, 12 Dec 2025 15:33:01 -0800, Chris M. Thomasson wrote:

On 12/12/2025 2:54 PM, Michael Sanders wrote:
[...]

fclose(f);
return 1; // probally text
}

define the probability? Say in 0...1?

[...]

Add it Chris & I'll roll it in =)

Me? I'd go with steps of say, 10% just to
make it human-friendly, but that's just me.

just thinking out loud about probabilities...

int is_text_file(const char *path, const uint8_t map[256], int probability)
--
:wq
Mike Sanders
--- Synchronet 3.21a-Linux NewsLink 1.2

Who's Online
Recent Visitors
- Ptb1970
  Sat Dec 13 17:34:42 2025
  from Wisconsin via Telnet
- Microbot
  Sat Dec 13 17:04:31 2025
  from Moore, Ok via Telnet
- John F Kennedy
  Fri Dec 12 21:48:00 2025
  from Crazyworldbbs.Com:2323 via Telnet
- Microbot
  Fri Dec 12 18:16:00 2025
  from Moore, Ok via Telnet

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,089
Nodes:	10 (0 / 10)
Uptime:	153:54:19
Calls:	13,921
Calls today:	2
Files:	187,021
D/L today:	3,760 files (944M bytes)
Messages:	2,457,163

is_binary_file()

Who's Online

Recent Visitors

System Info