Am I close? Missing anything you'd consider to be (or not) needed?
<stdio.h>
/*
* Checks if a file is likely a binary by examining its content
* for NULL bytes (0x00) or unusual control characters.
* Returns 0 if text, 1 if binary or file open failure.
*/
int is_binary_file(const char *path) {--
FILE *f = fopen(path, "rb");
if (!f) return 1; // cannot open file, treat as error/fail check
unsigned char buf[65536];
size_t n, i;
while ((n = fread(buf, 1, sizeof(buf), f)) > 0) {
for (i = 0; i < n; i++) {
unsigned char c = buf[i];
// 1. check for the NULL byte (strong indicator of binary data)
if (c == 0x00) {
fclose(f);
return 1; // IS binary
}
// 2. check for C0 control codes (0x01-0x1F), excluding known
// text formatting characters: 0x09 (Tab), 0x0A (LF), 0x0D (CR)
if (c < 0x20) {
if (c != 0x09 && c != 0x0A && c != 0x0D) {
fclose(f);
return 1; // IS binary (contains unexpected control code)
}
}
}
}
fclose(f);
return 0; // NOT binary
}
Am I close? Missing anything you'd consider to be (or not) needed?
<stdio.h>
/*
* Checks if a file is likely a binary by examining its content
* for NULL bytes (0x00) or unusual control characters.
* Returns 0 if text, 1 if binary or file open failure.
*/
int is_binary_file(const char *path) {
FILE *f = fopen(path, "rb");
if (!f) return 1; // cannot open file, treat as error/fail check
unsigned char buf[65536];
size_t n, i;
while ((n = fread(buf, 1, sizeof(buf), f)) > 0) {
for (i = 0; i < n; i++) {
unsigned char c = buf[i];
// 1. check for the NULL byte (strong indicator of binary
data)
if (c == 0x00) {
fclose(f);
return 1; // IS binary
}
// 2. check for C0 control codes (0x01-0x1F), excluding known
// text formatting characters: 0x09 (Tab), 0x0A (LF), 0x0D (CR)
if (c < 0x20) {
if (c != 0x09 && c != 0x0A && c != 0x0D) {
fclose(f);
return 1; // IS binary (contains unexpected control code)
}
}
}
}
fclose(f);
return 0; // NOT binary
}
On Sat, 06 Dec 2025 01:05:44 +0000, Michael Sanders wrote:[snip]
Am I close? Missing anything you'd consider to be (or not) needed?
<stdio.h>
/*
* Checks if a file is likely a binary by examining its content
* for NULL bytes (0x00) or unusual control characters.
* Returns 0 if text, 1 if binary or file open failure.
*/
First off, until we get computers that store file data in formats
other than binary, /all/ files (text or not) are "binary" files
(meaning that an is_binary_file() function should always return true).
OTOH, "text files" are a distinguishable subset of binary files.
I suggest that this makes an "is_text_file()" function more valuable
and more fitting than an "is_binary_file()" function.
Secondly, ISTM that the function should return a unique failure value
rather than overload the "is binary" return value. After all, you
actually have three return values: is_text, is_not_text, and is_indeterminate (because of file access failure).
Am I close? Missing anything you'd consider to be (or not) needed?
<stdio.h>
/*
* Checks if a file is likely a binary by examining its content
* for NULL bytes (0x00) or unusual control characters.
* Returns 0 if text, 1 if binary or file open failure.
*/
int is_binary_file(const char *path) {
fclose(f);
return 0; // NOT binary
}
Am I close? Missing anything you'd consider to be (or not) needed?
<stdio.h>
/*
* Checks if a file is likely a binary by examining its content
* for NULL bytes (0x00) or unusual control characters.
* Returns 0 if text, 1 if binary or file open failure.
*/
int is_binary_file(const char *path) {
FILE *f = fopen(path, "rb");
if (!f) return 1; // cannot open file, treat as error/fail check
unsigned char buf[65536];
size_t n, i;
while ((n = fread(buf, 1, sizeof(buf), f)) > 0) {
for (i = 0; i < n; i++) {
unsigned char c = buf[i];
// 1. check for the NULL byte (strong indicator of binary data)
if (c == 0x00) {
fclose(f);
return 1; // IS binary
}
// 2. check for C0 control codes (0x01-0x1F), excluding known
// text formatting characters: 0x09 (Tab), 0x0A (LF), 0x0D (CR)
if (c < 0x20) {
if (c != 0x09 && c != 0x0A && c != 0x0D) {
fclose(f);
return 1; // IS binary (contains unexpected control code)
}
}
}
}
fclose(f);
return 0; // NOT binary
}
On 2025-12-06, Michael Sanders <porkchop@invalid.foo> wrote:
Am I close? Missing anything you'd consider to be (or not) needed?
<stdio.h>
/*
* Checks if a file is likely a binary by examining its content
* for NULL bytes (0x00) or unusual control characters.
* Returns 0 if text, 1 if binary or file open failure.
*/
int is_binary_file(const char *path) {
[ ... ]
fclose(f);
return 0; // NOT binary
}
How about:
int is_binary_file(const char *path)
{
FILE *f = fopen(path);
int yes = 0;
if (f) {
int ch;
while ((ch == getc(f)) != EOF) {
for (int i = 0; i < CHAR_BIT; i++, ch >>= 1) {
switch ((ch & 1)) {
case 0:
case 1:
break;
default:
Am I close? Missing anything you'd consider to be (or not) needed?
Michael Sanders <porkchop@invalid.foo> writes:
Am I close? Missing anything you'd consider to be (or not) needed?
There is no completely reliable way to do this, but you might be
able to make a reasonable guess. A binary file might happen to
contain only byte values that represent printable characters.
<stdio.h>
/*
* Checks if a file is likely a binary by examining its content
* for NULL bytes (0x00) or unusual control characters.
* Returns 0 if text, 1 if binary or file open failure.
*/
Please use the term "null bytes", not "NULL bytes". NULL is a standard
macro that expands to a null pointer constant.
On 2025-12-06, Michael Sanders <porkchop@invalid.foo> wrote:
Am I close? Missing anything you'd consider to be (or not) needed?
<stdio.h>
/*
* Checks if a file is likely a binary by examining its content
* for NULL bytes (0x00) or unusual control characters.
* Returns 0 if text, 1 if binary or file open failure.
*/
int is_binary_file(const char *path) {
[ ... ]
fclose(f);
return 0; // NOT binary
}
How about:
int is_binary_file(const char *path)
{
FILE *f = fopen(path);
if (f) {
Kaz Kylheku <046-301-5902@kylheku.com> writes:
On 2025-12-06, Michael Sanders <porkchop@invalid.foo> wrote:
Am I close? Missing anything you'd consider to be (or not) needed?
<stdio.h>
/*
* Checks if a file is likely a binary by examining its content
* for NULL bytes (0x00) or unusual control characters.
* Returns 0 if text, 1 if binary or file open failure.
*/
int is_binary_file(const char *path) {
[ ... ]
fclose(f);
return 0; // NOT binary
}
How about:
int is_binary_file(const char *path)
{
FILE *f = fopen(path);
if (f) {
while (isprint(getc(f)) {}
return (!feof(f));--
}
return 0;
}
On Sat, 06 Dec 2025 17:40:18 +0000, Scott Lurndal wrote:
Kaz Kylheku <046-301-5902@kylheku.com> writes:
On 2025-12-06, Michael Sanders <porkchop@invalid.foo> wrote:
Am I close? Missing anything you'd consider to be (or not) needed?
<stdio.h>
/*
* Checks if a file is likely a binary by examining its content
* for NULL bytes (0x00) or unusual control characters.
* Returns 0 if text, 1 if binary or file open failure.
*/
int is_binary_file(const char *path) {
[ ... ]
fclose(f);
return 0; // NOT binary
}
How about:
int is_binary_file(const char *path)
{
FILE *f = fopen(path);
if (f) {
while (isprint(getc(f)) {}
The isprint function tests for any member of a locale-specific
set of characters (each of which occupies one printing position
on a display device) including space (' ').
It effectively evaluates whether or not a given value is a
"printing character" in the execution characterset, not whether
or not a given value (from an outside file) is a text character.
I'd use this function cautiously, as it will produce false
results when the characterset of the source data is not the the
execution characterset (think a Unicode UTF16 encoded text
file, and an ASCII execution characterset).
return (!feof(f));
}
return 0;
}
----- Synchronet 3.21a-Linux NewsLink 1.2
Lew Pitcher
"In Skills We Trust"
Not LLM output - I'm just like this.
Lew Pitcher <lew.pitcher@digitalfreehold.ca> writes:
On Sat, 06 Dec 2025 17:40:18 +0000, Scott Lurndal wrote:
Kaz Kylheku <046-301-5902@kylheku.com> writes:
On 2025-12-06, Michael Sanders <porkchop@invalid.foo> wrote:
Am I close? Missing anything you'd consider to be (or not) needed?
<stdio.h>
/*
* Checks if a file is likely a binary by examining its content
* for NULL bytes (0x00) or unusual control characters.
* Returns 0 if text, 1 if binary or file open failure.
*/
int is_binary_file(const char *path) {
[ ... ]
fclose(f);
return 0; // NOT binary
}
How about:
int is_binary_file(const char *path)
{
FILE *f = fopen(path);
if (f) {
while (isprint(getc(f)) {}
The isprint function tests for any member of a locale-specific
set of characters (each of which occupies one printing position
on a display device) including space (' ').
It effectively evaluates whether or not a given value is a
"printing character" in the execution characterset, not whether
or not a given value (from an outside file) is a text character.
What is your definition of a "text" character?
--
I'd use this function cautiously, as it will produce false
results when the characterset of the source data is not the the
execution characterset (think a Unicode UTF16 encoded text
file, and an ASCII execution characterset).
return (!feof(f));
}
return 0;
}
Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:[...]
Please use the term "null bytes", not "NULL bytes". NULL is a standard >>macro that expands to a null pointer constant.
The proper term IMO is 'NUL' byte as defined by ASCII.
Am I close? Missing anything you'd consider to be (or not) needed?
<stdio.h>
/*
* Checks if a file is likely a binary by examining its content
* for NULL bytes (0x00) or unusual control characters.
* Returns 0 if text, 1 if binary or file open failure.
*/
Am I close? Missing anything you'd consider to be (or not) needed?
<snip>
Some older operating systems actually stored the file type in
metadata (like the unix inode). The Burroughs MCP filesystems
included a file-type field in the metadata for a file; the CANDE editor
would use this to determine the programming language (and the associated language formatting rules a la COBOL or FORTRAN vis-a-vis column
assignments for the sequence number, program verbs, etc.
On 12/6/2025 10:37 AM, Scott Lurndal wrote:
<snip>
Some older operating systems actually stored the file type in
metadata (like the unix inode). The Burroughs MCP filesystems
included a file-type field in the metadata for a file; the CANDE editor
would use this to determine the programming language (and the associated
language formatting rules a la COBOL or FORTRAN vis-a-vis column
assignments for the sequence number, program verbs, etc.
The Burroughs file attribute name was "FILEKIND," and it took values
like ALGOLSYMBOL (for an ALGOL source file) and ALGOLCODE (for an
executable compiled with ALGOL). Other file attributes included maximum >record length, character encoding (e.g. ASCII or EBCDIC), and lots more.
This brings back memories, most of them fond.
As far as I can tell, UNISYS MCP systems still have all that:
https://public.support.unisys.com/aseries/docs/ClearPath-MCP-19.0/86000064-520/86000064-520/chapter-000002094.html
Michael Sanders <porkchop@invalid.foo> writes:
Am I close? Missing anything you'd consider to be (or not) needed?Technically, there is no such thing as a "binary" file. All files
are simply sequences of bytes with no format implied. Interpretation
of the file content is purely application dependent.
C-based applications have certain restrictions on text format
due to the use of the ASCII NUL code as a string terminator, but
that's C. The content of a text file processed by a different
language, or by C using application-defined string containers
can easily contain a NUL byte yet still be considered "text"
if that distinction is necessary.
Because of C/C++, a valid UTF-8 encoding will not include
the NUL byte.
Am I close? Missing anything you'd consider to be (or not) needed?
<stdio.h>
/*
* Checks if a file is likely a binary by examining its content
* for NULL bytes (0x00) or unusual control characters.
* Returns 0 if text, 1 if binary or file open failure.
*/
int is_binary_file(const char *path) {
FILE *f = fopen(path, "rb");
if (!f) return 1; // cannot open file, treat as error/fail check
unsigned char buf[65536];
size_t n, i;
while ((n = fread(buf, 1, sizeof(buf), f)) > 0) {
for (i = 0; i < n; i++) {
unsigned char c = buf[i];
// 1. check for the NULL byte (strong indicator of binary data)
if (c == 0x00) {
fclose(f);
return 1; // IS binary
}
// 2. check for C0 control codes (0x01-0x1F), excluding known
// text formatting characters: 0x09 (Tab), 0x0A (LF), 0x0D (CR)
if (c < 0x20) {
if (c != 0x09 && c != 0x0A && c != 0x0D) {
fclose(f);
return 1; // IS binary (contains unexpected control code)
}
}
}
}
fclose(f);
return 0; // NOT binary
}
On 06/12/2025 01:05, Michael Sanders wrote:
Am I close? Missing anything you'd consider to be (or not)
needed?
A text file is supposed to end with a '\n' (M$, of course,
largely ignores this convention), but a quick test could be:
f = fopen(path, "rb");
fseek(f, -1, SEEK_END);
int is_binary_file(const char *path) {
On 07/12/2025 19:01, Richard Harnden wrote:
On 06/12/2025 01:05, Michael Sanders wrote:
Am I close? Missing anything you'd consider to be (or not)
needed?
A text file is supposed to end with a '\n' (M$, of course, largely
ignores this convention), but a quick test could be:
f = fopen(path, "rb");
fseek(f, -1, SEEK_END);
Not guaranteed to work with binary files...
7.19.9.2(3)
A binary stream need not meaningfully support fseek calls with a whence value of SEEK_END.
...or text files.
7.19.9.2(4)
For a text stream, either offset shall be zero, or offset shall
be a value returned by an earlier successful call to the ftell function
on a stream associated with the same file and whence shall be SEEK_SET.
On 07/12/2025 19:01, Richard Harnden wrote:
On 06/12/2025 01:05, Michael Sanders wrote:
Am I close? Missing anything you'd consider to be (or not)
needed?
A text file is supposed to end with a '\n' (M$, of course, largely
ignores this convention), but a quick test could be:
f = fopen(path, "rb");
fseek(f, -1, SEEK_END);
Not guaranteed to work with binary files...
7.19.9.2(3)
A binary stream need not meaningfully support fseek calls with a
whence value of SEEK_END.
...or text files.
7.19.9.2(4)
For a text stream, either offset shall be zero, or offset shall
be a value returned by an earlier successful call to the ftell
function on a stream associated with the same file and whence shall be SEEK_SET.
On 07/12/2025 19:01, Richard Harnden wrote:
On 06/12/2025 01:05, Michael Sanders wrote:
Am I close? Missing anything you'd consider to be (or not)
needed?
A text file is supposed to end with a '\n' (M$, of course,
largely ignores this convention), but a quick test could be:
f = fopen(path, "rb");
fseek(f, -1, SEEK_END);
Not guaranteed to work with binary files...
7.19.9.2(3)
A binary stream need not meaningfully support fseek calls with a
whence value of SEEK_END.
Am 07.12.2025 um 22:51 schrieb Richard Heathfield:
On 07/12/2025 19:01, Richard Harnden wrote:
On 06/12/2025 01:05, Michael Sanders wrote:
Am I close? Missing anything you'd consider to be (or not)
needed?
A text file is supposed to end with a '\n' (M$, of course, largely
ignores this convention), but a quick test could be:
f = fopen(path, "rb");
fseek(f, -1, SEEK_END);
Not guaranteed to work with binary files...
7.19.9.2(3)
A binary stream need not meaningfully support fseek calls with a
whence value of SEEK_END.
From the glibc Reference Manual:
HTH
There is no completely reliable way to do this, but you might be
able to make a reasonable guess. A binary file might happen to
contain only byte values that represent printable characters.
Please use the term "null bytes", not "NULL bytes". NULL is a standard
macro that expands to a null pointer constant.
It seems odd to say that a file is assumed to be binary if you can't
open it. I suggest having the function return more than two distinct
values:
- File seems to be binary
- File seems to be text
- Could be either
- Something went wrong
An enum is probably a good choice.
0x00 -> '\0'
0x20 -> ' '
0x09 -> '\t'
0x0A -> '\n'
0x0D -> '\r'
Depending on how far you want to get into it, distinguishing between
text and binary files is anywhere from difficult to literally
impossible.
How about:
[...]
It is the year 2025.
How many times do you suppose someone has considered this question ?
I'm not trying to be a smart ass by saying this, just that the
question is bound to be nuanced. You can do a fast and totally
inaccurate determination. You can do a computationally expensive
or I/O expensive determination.
There has to be a reason for doing this, and a damn good reason.
*******
There is the "file" command.
It was invented in 1973.
https://en.wikipedia.org/wiki/File_%28command%29
The beauty of this command, is it has some sort of ordered
approach to file determination.
[...]
NULL is a macro that expands to a null pointer constant. I think you
mean "null character". This isn't just nit-picking -C is a
case-sensitive language, so it's essential to pay attention to case.
You should return a distinct value for file open failure - a file that
cannot be opened cannot be determined to be either a text or a binary file.
You really cannot distinguish with certainty whether a file is a text
file or a binary file based solely upon the contents. A file whose
format is an array of two-byte 2's complement little-endian integers
would normally be considered binary, yet it might happen to contain
integers whose bytes all happen to be printable characters.
This implies the use of the isprint() function; the only other
characters you need to handle specifically are '\t', '\n', and ' '.
Since the result returned by isprint() is locale-dependent, the program should, at least optionally, use setlocale().
You miss definition: you should first decide what you consider to
be a binary file (this is hard part).
A text file is supposed to end with a '\n' (M$, of course, largely
ignores this convention), but a quick test could be:
f = fopen(path, "rb");
fseek(f, -1, SEEK_END);
if ( (c = fgetc(f)) == '\n' )
printf("Text\n");
else
printf("Binary\n");
fclose(f);
Be aware of false positives/negatives, because I'm sure there will be
plenty :)
You can return a float from is_binary_file() to show a probability? Not exactly sure how you can 100% guarantee it...
Bonita Montero <Bonita.Montero@gmail.com> writes:
Am 07.12.2025 um 22:51 schrieb Richard Heathfield:Has nothing to do with glibc. Dates back to the earliest
On 07/12/2025 19:01, Richard Harnden wrote:From the glibc Reference Manual:
On 06/12/2025 01:05, Michael Sanders wrote:Not guaranteed to work with binary files...
Am I close? Missing anything you'd consider to be (or not)A text file is supposed to end with a '\n' (M$, of course, largely
needed?
ignores this convention), but a quick test could be:
f = fopen(path, "rb");
fseek(f, -1, SEEK_END);
7.19.9.2(3)
A binary stream need not meaningfully support fseek calls with a
whence value of SEEK_END.
days of unix, and is codified by POSIX/SUS.
On Sun, 7 Dec 2025 03:43:58 -0000 (UTC), Waldek Hebisch wrote:
You miss definition: you should first decide what you consider to
be a binary file (this is hard part).
Yes. This is it - everything right here Waldek, that is my entire
problem.
On Sat, 6 Dec 2025 02:42:39 -0000 (UTC), Kaz Kylheku wrote:
How about:
[...]
You sir are an OCD coder =)
Am I close? Missing anything you'd consider to be (or not) needed?
<stdio.h>
/*
* Checks if a file is likely a binary by examining its content
* for NULL bytes (0x00) or unusual control characters.
* Returns 0 if text, 1 if binary or file open failure.
*/
int is_binary_file(const char *path) {
FILE *f = fopen(path, "rb");
if (!f) return 1; // cannot open file, treat as error/fail check
unsigned char buf[65536];
size_t n, i;
while ((n = fread(buf, 1, sizeof(buf), f)) > 0) {
for (i = 0; i < n; i++) {
unsigned char c = buf[i];
// 1. check for the NULL byte (strong indicator of binary data)
if (c == 0x00) {
fclose(f);
return 1; // IS binary
}
// 2. check for C0 control codes (0x01-0x1F), excluding known
// text formatting characters: 0x09 (Tab), 0x0A (LF), 0x0D (CR)
if (c < 0x20) {
if (c != 0x09 && c != 0x0A && c != 0x0D) {
fclose(f);
return 1; // IS binary (contains unexpected control code)
}
}
}
}
fclose(f);
return 0; // NOT binary
}
On 2025-12-08, Michael Sanders <porkchop@invalid.foo> wrote:
On Sat, 6 Dec 2025 02:42:39 -0000 (UTC), Kaz Kylheku wrote:
How about:
[...]
You sir are an OCD coder =)
At last, someone seems to have gotten the joke.
On Sat, 6 Dec 2025 03:14:55 -0500, Paul wrote:
It is the year 2025.
How many times do you suppose someone has considered this question ?
I'm not trying to be a smart ass by saying this, just that the
question is bound to be nuanced. You can do a fast and totally
inaccurate determination. You can do a computationally expensive
or I/O expensive determination.
I get it Paul, but as with all things, there's lots of opinions on this.
There has to be a reason for doing this, and a damn good reason.
*******
There is the "file" command.
It was invented in 1973.
https://en.wikipedia.org/wiki/File_%28command%29
The beauty of this command, is it has some sort of ordered
approach to file determination.
And... is not generally available on Windows
Michael Sanders <porkchop@invalid.foo> writes:
On Sat, 6 Dec 2025 03:14:55 -0500, Paul wrote:
It is the year 2025.
How many times do you suppose someone has considered this question ?
I'm not trying to be a smart ass by saying this, just that the
question is bound to be nuanced. You can do a fast and totally
inaccurate determination. You can do a computationally expensive
or I/O expensive determination.
I get it Paul, but as with all things, there's lots of opinions on this.
There has to be a reason for doing this, and a damn good reason.
*******
There is the "file" command.
It was invented in 1973.
https://en.wikipedia.org/wiki/File_%28command%29
The beauty of this command, is it has some sort of ordered
approach to file determination.
And... is not generally available on Windows
It is open source and could be built for windows.
It's also included in any linux distribution running
under WSL.
But surely on Windows you can just look at the file extension -
if it is ".txt", it's a text file, otherwise it's a binary file.
ly - Lilypond source
But surely on Windows you can just look at the file extension - if it is ".txt", it's a text file, otherwise it's a binary file.
On 08/12/2025 21:16, Scott Lurndal wrote:
Michael Sanders <porkchop@invalid.foo> writes:
On Sat, 6 Dec 2025 03:14:55 -0500, Paul wrote:
It is the year 2025.
How many times do you suppose someone has considered this question ?
I'm not trying to be a smart ass by saying this, just that the
question is bound to be nuanced. You can do a fast and totally
inaccurate determination. You can do a computationally expensive
or I/O expensive determination.
I get it Paul, but as with all things, there's lots of opinions on this. >>>
There has to be a reason for doing this, and a damn good reason.
*******
There is the "file" command.
It was invented in 1973.
https://en.wikipedia.org/wiki/File_%28command%29
The beauty of this command, is it has some sort of ordered
approach to file determination.
And... is not generally available on Windows
It is open source and could be built for windows.
It's also included in any linux distribution running
under WSL.
It is available anywhere you find Windows ports of common *nix utilities, such as the msys2 project. (And while an msys2 installation can be quite large, it's possible to pull out individual utilities if you need to.) Still, it's fair to say that most Windows installations don't have it.
But surely on Windows you can just look at the file extension - if it is ".txt", it's a text file, otherwise it's a binary file.
On Tue, 12/9/2025 3:03 AM, David Brown wrote:
On 08/12/2025 21:16, Scott Lurndal wrote:
Michael Sanders <porkchop@invalid.foo> writes:
On Sat, 6 Dec 2025 03:14:55 -0500, Paul wrote:
It is the year 2025.
How many times do you suppose someone has considered this
question ?
I'm not trying to be a smart ass by saying this, just that the
question is bound to be nuanced. You can do a fast and totally
inaccurate determination. You can do a computationally expensive
or I/O expensive determination.
I get it Paul, but as with all things, there's lots of opinions
on this.
There has to be a reason for doing this, and a damn good reason.
*******
There is the "file" command.
It was invented in 1973.
https://en.wikipedia.org/wiki/File_%28command%29
The beauty of this command, is it has some sort of ordered
approach to file determination.
And... is not generally available on Windows
It is open source and could be built for windows.
It's also included in any linux distribution running
under WSL.
It is available anywhere you find Windows ports of common *nix
utilities, such as the msys2 project. (And while an msys2
installation can be quite large, it's possible to pull out
individual utilities if you need to.) Still, it's fair to say that
most Windows installations don't have it.
But surely on Windows you can just look at the file extension - if
it is ".txt", it's a text file, otherwise it's a binary file.
There are a couple ways to get it.
The problem with this one, is /etc/magic is as old as the hills
and does not have nearly as much capability. On the plus side,
it's not going to burn your house down either.
https://gnuwin32.sourceforge.net/packages/file.htm
A second source, is Cygwin, but again, it might depend on
when the port was done. Doing it this way has to be better
than the previous link, just because the previous one is
so old.
https://cygwin.com/packages/summary/file.html
And the Wiki on msys2 says this:
"MSYS2 ("minimal system 2") is a software distribution and a
development platform for Microsoft Windows, based on Mingw-w64
and Cygwin "
It still means when the release was done, could matter.
I started with Cygwin64. This is an example of an executable, but
it relies on other dependencies.
https://mirror.csclub.uwaterloo.ca/cygwin/x86_64/release/file/file-5.46-1-x86_64.tar.xz
The installer is here.
https://cygwin.com/setup-x86_64.exe
# After installation, I checked the dependencies. This does not
# help you find the /etc/magic file for its usage.
$ cygcheck /usr/bin/file.exe
C:\cygwin64\bin\file.exe
C:\cygwin64\bin\cygmagic-1.dll
C:\cygwin64\bin\cygbz2-1.dll
C:\cygwin64\bin\cygwin1.dll
C:\WINDOWS\system32\KERNEL32.dll
C:\WINDOWS\system32\ntdll.dll
C:\WINDOWS\system32\KERNELBASE.dll
C:\cygwin64\bin\cyglzma-5.dll
C:\cygwin64\bin\cygz.dll
C:\cygwin64\bin\cygzstd-1.dll
Testing did not go well. I tested the "find.exe" in Cygwin64
and it did not finish. I used Process Monitor to see what it
was doing, and there was a lot of registry activity. (There
should not be registry activity by find.exe or file.exe )
I tried the file.exe command and it didn't provide output
and the machine hung. My machine never hangs. It's a model
citizen. Windows Defender did not trip. An offline scan
with Windows Defender did not find anything. This is possibly
Process Monitor using all RAM, but that does not normally
happen until 20 minutes or more have passed, and I was only
running tracing for a minute or two.
Cygwin materials are held on mirror sites, and I was using
a mirror (University of Waterloo). For the time being, I would
recommend some isolation while you test that.
*******
On to msys2.
https://www.msys2.org/
Name: msys2-x86_64-20250830.exe
Size: 93,680,251 bytes (89 MiB)
SHA256:
B54705073678D32686A2CC356BB552363429E6CCBABBFECCB6D3CB7EC101E73B
"Last analysis 22 hours ago", so it is likely someone in this thread triggered a retest.
https://www.virustotal.com/gui/file/b54705073678d32686a2cc356bb552363429e6ccbabbfeccb6d3cb7ec101e73b
[Clean]
Install on disk is 350MB in C:\msys64
https://www.msys2.org/docs/installer/
C:/msys64/msys2_shell.cmd -defterm -here -no-start -ucrt64 # Do not
run elevated (use the unelevated terminal) # Windows Terminal prompt
changes color
$ cd /c/msys64/usr/bin
$ file.exe file.exe
file.exe: PE32+ executable for MS Windows 5.02 (console), x86-64
(stripped to external PDB), 10 sections $ cd /s/disktype
$ file disktype.exe
disktype.exe: PE32 executable for MS Windows 4.00 (console), Intel
i386, 16 sections # cygwin32 executable? # I change directory to
the corrupted Sent file and check it with the msys2 version. $ file
Sent Sent: Mailbox text, 1st line "From - Wed Nov 26 06:13:35 2008"
# I compare to the WSL file command
$ file Sent
Sent: Non-ISO extended-ASCII text, with very long lines, with CRLF,
NEL line terminators # The corruption detection...
This tells me the msys2 has an older version of magic determinationI never tried cygwin64. For what I do, the level of compatibility
on the file.exe command .
And for the cygwin64, use the rubber gloves on it.
It did not work as expected. Use your SafeHex handling
techniques, until it proves in for you.
Paul
It's not clear what the actual problem is. What is the use-case for a function that tells you whether any file /might/ be a text-file based on speculative analysis of its contents?
Is the result /meant/ to be fuzzy?
On 09/12/2025 09:43, Richard Heathfield wrote:
ly - Lilypond source
Off topic, but ... Lilypond is a lovely thing :)
On Sun, 7 Dec 2025 14:42:39 -0800, Chris M. Thomasson wrote:
You can return a float from is_binary_file() to show a probability? Not
exactly sure how you can 100% guarantee it...
Ha!
You know, that's a crazy idea but a darn cool idea at the same time!
It's not clear what the actual problem is. What is the use-case for a function that tells you whether any file /might/ be a text-file based on speculative analysis of its contents?The fundamental problem is that no analysis of the contents can give you anything other than a fuzzy result. There's nothing more clearly a
Is the result /meant/ to be fuzzy?
"Data read in from a text stream will necessarily compare equal to the
data that were earlier written out to that stream only if: the data
consist only of printing characters and the control characters
horizontal tab and new-line; no new-line character is immediately
preceded by space characters; and the last character is a new-line character." (7.23.2p2).
I believe it therefore makes sense to consider something to be a text
file if it meets those requirements, and otherwise is a binary file.
Note that the last requirement implies that an empty file cannot qualify
as text - at a minimum, it must contain a new-line character.
This implies the use of the isprint() function; the only other
characters you need to handle specifically are '\t', '\n', and ' '.
Since the result returned by isprint() is locale-dependent, the program should, at least optionally, use setlocale().
For yet another set of unreliable hueristics for guessing whether a file
is text or binary, you can take a look at Perl's built-in "-T" and "-B" operators.
At last, someone seems to have gotten the joke.
On Mon, 8 Dec 2025 18:44:33 +0000, bart wrote:
It's not clear what the actual problem is. What is the use-case
for a function that tells you whether any file /might/ be a
text-file based on speculative analysis of its contents? Is
the result /meant/ to be fuzzy?
Hey bart.
What I mean is that since I have not yet defined a canonical
standard for my program, the goal here (to determine if my code
can parse the file) is unclear.
It means I need to plan much more *before* I write more code, no
mean feat when one is excited & ready to jump in =)
On 12/9/25 09:03, David Brown wrote:
But surely on Windows you can just look at the file extension - if it is ".txt", it's a text file, otherwise it's a binary file.
And what about PNM files who can be pure ascii encoded,
but was image files ?
On 2025-12-06 20:37, James Kuyper wrote:
...
"Data read in from a text stream will necessarily compare equal to
the data that were earlier written out to that stream only if: the
data consist only of printing characters and the control characters horizontal tab and new-line; no new-line character is immediately
preceded by space characters; and the last character is a new-line character." (7.23.2p2).
I believe it therefore makes sense to consider something to be a
text file if it meets those requirements, and otherwise is a binary
file. Note that the last requirement implies that an empty file
cannot qualify as text - at a minimum, it must contain a new-line character.
This implies the use of the isprint() function; the only other
characters you need to handle specifically are '\t', '\n', and ' '.
Since the result returned by isprint() is locale-dependent, the
program should, at least optionally, use setlocale().
I just realized an annoying complication. Whatever
implementation-specific method is used to indicate end-of-line can
only be portably identified as such by opening the file in text mode
and looking for the newline characters that it gets converted into.
But because of 7.23.2p2, text mode cannot be relied upon for
precisely the files we're trying to identify.
I should have added that I feel that you probably haven't really
defined /what/ "text file" means, and that has interfered with
the development of this function. As Keith pointed out, the task
of distinguishing between a "text" file and a "binary" file is not
easy. I'll add that a lot of the difficulty stems from the fact
that there are many definitions (some conflicting) of what a "text"
file actually contains.
[...]
On Sat, 6 Dec 2025 02:00:22 -0000 (UTC), Lew Pitcher wrote:
I should have added that I feel that you probably haven't really
defined /what/ "text file" means, and that has interfered with
the development of this function. As Keith pointed out, the task
of distinguishing between a "text" file and a "binary" file is not
easy. I'll add that a lot of the difficulty stems from the fact
that there are many definitions (some conflicting) of what a "text"
file actually contains.
Yes. Here's my 2nd attempt following the template (of thinking)
you've suggested...
On Sat, 6 Dec 2025 02:00:22 -0000 (UTC), Lew Pitcher wrote:
I should have added that I feel that you probably haven't really
defined /what/ "text file" means, and that has interfered with
the development of this function. As Keith pointed out, the task
of distinguishing between a "text" file and a "binary" file is not
easy. I'll add that a lot of the difficulty stems from the fact
that there are many definitions (some conflicting) of what a "text"
file actually contains.
Yes. Here's my 2nd attempt following the template (of thinking)
you've suggested...
#include <stdio.h> // FILE, fopen, fread, fclose
#include <stddef.h> // size_t
// is_text_file()
// Returns:
// -1 : could not open file
// 0 : is NOT a text file (binary indicators found)
// 1 : is PROBABLY a text file (no strong binary signatures)
int is_text_file(const char *path) {
// Try opening the file in binary mode,
// required so that bytes are read exact.
FILE *f = fopen(path, "rb");
if (!f) return -1; // Could not open file
unsigned char buf[4096]; // 4KB chunks
size_t n, i;
// Read in file until EOF
while ((n = fread(buf, 1, sizeof(buf), f)) > 0) {
for (i = 0; i < n; i++) {
unsigned char c = buf[i];
// 1. null byte is a very strong indication of binary data.
// Text files virtually never contain 0x00.
if (c == 0x00) {
fclose(f);
return 0; // Contains binary-only byte: NOT text
}
// 2. Check for raw C0 control codes (0x01–0x1F).
// We *allow* \t (09), \n (0A), \r (0D) because they are normal in text.
// Any other control code is highly suspicious and usually means binary.
if (c < 0x20) {
if (c != 0x09 && c != 0x0A && c != 0x0D) {
fclose(f);
return 0; // unexpected control character → NOT text
}
}
// 3. NOTE: We intentionally do *not* reject bytes >= 0x80.
// These occur in UTF-8, extended ASCII, and many local encodings.
// Rejecting them would treat valid multilingual text as binary.
// So we treat high bytes as acceptable for "probably text".
}--
}
fclose(f);
return 1; // Probably text (no strong binary signatures found)
}
Michael Sanders <porkchop@invalid.foo> writes:
On Sat, 6 Dec 2025 02:00:22 -0000 (UTC), Lew Pitcher wrote:
I should have added that I feel that you probably haven't really
defined /what/ "text file" means, and that has interfered with
the development of this function. As Keith pointed out, the task
of distinguishing between a "text" file and a "binary" file is not
easy. I'll add that a lot of the difficulty stems from the fact
that there are many definitions (some conflicting) of what a "text"
file actually contains.
Yes. Here's my 2nd attempt following the template (of thinking)
you've suggested...
The problem with all of your attempts is the performance
issue. Success requires reading every single byte of the
file, one byte at a time. The word 'slow' is not sufficient
to describe how bad the performance will be for a very large
file.
At a minimum, dump the stdio double-buffered byte-by-byte
algorithm and use mmap().
In reality, I still don't see any benefit to this type of
heuristic-based approach.
On Wed, 10 Dec 2025 15:07:30 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Michael Sanders <porkchop@invalid.foo> writes:
On Sat, 6 Dec 2025 02:00:22 -0000 (UTC), Lew Pitcher wrote:
I should have added that I feel that you probably haven't really
defined /what/ "text file" means, and that has interfered with
the development of this function. As Keith pointed out, the task
of distinguishing between a "text" file and a "binary" file is not
easy. I'll add that a lot of the difficulty stems from the fact
that there are many definitions (some conflicting) of what a "text"
file actually contains.
Yes. Here's my 2nd attempt following the template (of thinking)
you've suggested...
The problem with all of your attempts is the performance
issue. Success requires reading every single byte of the
file, one byte at a time. The word 'slow' is not sufficient
to describe how bad the performance will be for a very large
file.
At a minimum, dump the stdio double-buffered byte-by-byte
algorithm and use mmap().
I suggest to do actual speed measurements before making bold
claims like above. Don't trust your intuition!
#include <stdio.h> // FILE, fopen, fread, fclose
#include <stddef.h> // size_t
// is_text_file()
// Returns:
// -1 : could not open file
// 0 : is NOT a text file (binary indicators found)
// 1 : is PROBABLY a text file (no strong binary signatures)
int is_text_file(const char *path) {
// Try opening the file in binary mode,
// required so that bytes are read exact.
FILE *f = fopen(path, "rb");
if (!f) return -1; // Could not open file
unsigned char buf[4096]; // 4KB chunks
size_t n, i;
// Read in file until EOF
while ((n = fread(buf, 1, sizeof(buf), f)) > 0) {
for (i = 0; i < n; i++) {
unsigned char c = buf[i];
// 1. null byte is a very strong indication of binary data.
// Text files virtually never contain 0x00.
if (c == 0x00) {
fclose(f);
return 0; // Contains binary-only byte: NOT text
}
// 2. Check for raw C0 control codes (0x01–0x1F).
// We *allow* \t (09), \n (0A), \r (0D) because they are normal in text.
// Any other control code is highly suspicious and usually means binary.
if (c < 0x20) {
if (c != 0x09 && c != 0x0A && c != 0x0D) {
fclose(f);
return 0; // unexpected control character → NOT text
}
}
On Tue, 9 Dec 2025 16:29:39 -0500...
James Kuyper <jameskuyper@alumni.caltech.edu> wrote:
I just realized an annoying complication. Whatever
implementation-specific method is used to indicate end-of-line can
only be portably identified as such by opening the file in text mode
and looking for the newline characters that it gets converted into.
But because of 7.23.2p2, text mode cannot be relied upon for
precisely the files we're trying to identify.
Does not sound like a problem. According to my understanding, wide portability was never a part of the OP's spec.
Yes. Here's my 2nd attempt...
[...]
The problem with all of your attempts is the performance
issue. Success requires reading every single byte of the
file, one byte at a time. The word 'slow' is not sufficient
to describe how bad the performance will be for a very large
file.
At a minimum, dump the stdio double-buffered byte-by-byte
algorithm and use mmap().
In reality, I still don't see any benefit to this type of
heuristic-based approach.
[...]
I would recommend against use of explicit numerical codes for
characters. They make your code dependent upon a particular encoding,
and you're free to make that choice, but for implementations where that encoding is the default, the corresponding C escape sequences will have precisely the the correct value, and make it easier to understand what
your code is doing:
0x00 '\0'
0x09 '\t'
0x0A '\n'
0x0D '\r'
0x20 ' '
Michael S <already5chosen@yahoo.com> writes:
On Wed, 10 Dec 2025 15:07:30 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Michael Sanders <porkchop@invalid.foo> writes:
On Sat, 6 Dec 2025 02:00:22 -0000 (UTC), Lew Pitcher wrote:
I should have added that I feel that you probably haven't really
defined /what/ "text file" means, and that has interfered with
the development of this function. As Keith pointed out, the task
of distinguishing between a "text" file and a "binary" file is not
easy. I'll add that a lot of the difficulty stems from the fact
that there are many definitions (some conflicting) of what a "text"
file actually contains.
Yes. Here's my 2nd attempt following the template (of thinking)
you've suggested...
The problem with all of your attempts is the performance
issue. Success requires reading every single byte of the
file, one byte at a time. The word 'slow' is not sufficient
to describe how bad the performance will be for a very large
file.
At a minimum, dump the stdio double-buffered byte-by-byte
algorithm and use mmap().
I suggest to do actual speed measurements before making bold
claims like above. Don't trust your intuition!
I have, more than once, done such measurements after mmap()
was introduced in SVR4 circa 1989 (ported from SunOS).
On a single-user system, running a single job, the difference
for smaller files is in the noise. For larger files, or when
the system is heavily loaded or multiuser, it can be significant.
Last version for me (I have to pivot to other things).
[...]
On Wed, 10 Dec 2025 18:41:22 -0000 (UTC), Michael Sanders wrote:
Last version for me (I have to pivot to other things).
[...]
smaller look up table still + bit shifting!
*fastest implantation yet* but virtually unreadable =(
#include <stdio.h>
#include <stddef.h>
#include <stdint.h>
// is_text_file()
// Returns:
// -1 : could not open file
// 0 : is NOT a text file (binary indicators found)
// 1 : is PROBABLY a text file (no strong binary signatures)
int is_text_file(const char *path) {
FILE *f = fopen(path, "rb");
if (!f) return -1;
unsigned char chunk[4096];
size_t n, i;
// 128-bit bitmask (16 bytes × 8 bits / byte), 1=allowed, 0=disallowed
// Allowed bytes: TAB(0x09), LF(0x0A), CR(0x0D), printable ASCII 0x20–0x7E
static const uint8_t MASK[16] = {
0x00, 0x24, 0x00, 0x00, // 0x00–0x0F: TAB(09), LF(0A), CR(0D)
0xFF, 0xFF, 0xFF, 0xFF, // 0x10–0x2F: SPC!"#$%&'()*+,-./
0xFF, 0xFF, 0xFF, 0xFF, // 0x30–0x4F: 0123456789:;<=>?@
0xFF, 0xFF, 0xFF, 0x7F // 0x50–0x7F: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdef...
};
while ((n = fread(chunk, 1, sizeof(chunk), f)) > 0) {
for (i = 0; i < n; i++) {
if (chunk[i] < 128 && !(MASK[chunk[i] >> 3] & (1 << (chunk[i] & 7)))) {
fclose(f);
return 0; // binary indicator found
}
// bytes >= 128 are accepted as probably text
On 10/12/2025 17:18, Scott Lurndal wrote:
Michael S <already5chosen@yahoo.com> writes:
On Wed, 10 Dec 2025 15:07:30 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Michael Sanders <porkchop@invalid.foo> writes:
On Sat, 6 Dec 2025 02:00:22 -0000 (UTC), Lew Pitcher wrote:
I should have added that I feel that you probably haven't really
defined /what/ "text file" means, and that has interfered with
the development of this function. As Keith pointed out, the task
of distinguishing between a "text" file and a "binary" file is not >>>>>> easy. I'll add that a lot of the difficulty stems from the fact
that there are many definitions (some conflicting) of what a "text" >>>>>> file actually contains.
Yes. Here's my 2nd attempt following the template (of thinking)
you've suggested...
The problem with all of your attempts is the performance
issue. Success requires reading every single byte of the
file, one byte at a time. The word 'slow' is not sufficient
to describe how bad the performance will be for a very large
file.
At a minimum, dump the stdio double-buffered byte-by-byte
algorithm and use mmap().
I suggest to do actual speed measurements before making bold
claims like above. Don't trust your intuition!
I have, more than once, done such measurements after mmap()
was introduced in SVR4 circa 1989 (ported from SunOS).
On a single-user system, running a single job, the difference
for smaller files is in the noise. For larger files, or when
the system is heavily loaded or multiuser, it can be significant.
1989 is 36 years ago. Technology has moved on. If reading your file is
too slow to read, get yourself a real computer.
On my very ordinary desktop machine, I just freq'd[1] a 7,032,963,565-
byte file in 12.256 seconds. That's 573,838,410 bytes per second. It's a damn sight faster than I could do by hand.
How, exactly, are you using `slow'?
On Tue, 09 Dec 2025 15:42:59 -0800, Keith Thompson wrote:
[...]
Keith if you get a chance see my reply to Lew 'is_text_file()'
Let me know if I've inched closer a step or two...
Michael Sanders <porkchop@invalid.foo> writes:
On Tue, 09 Dec 2025 15:42:59 -0800, Keith Thompson wrote:
[...]
Keith if you get a chance see my reply to Lew 'is_text_file()'
Let me know if I've inched closer a step or two...
Closer to what exactly?
In the parent article, I suggested that you likely don't need to
determine whether a file is "text" or "binary". You said you want
to parse a file. An attempt to parse it will fail either if the
input is binary or if it's text that doesn't match the grammar you
require. For example, a parser for C source code doesn't need to
check whether the input is binary or text. Certain input
characters will simply cause the parse to fail, and a syntax error
can be reported. Tell us more about how you want to parse files.
Are you parsing according to a formal grammar? Or is it more
ad-hoc?
Typically a soi disant extended ASCII character set (e.g. ISO-8859-1)
have the first 32 bytes starting at 128 defined as control characters.
https://en.wikipedia.org/wiki/ISO/IEC_8859-1#Code_page_layout
On 10/12/2025 19:42, Richard Heathfield wrote:
On 10/12/2025 17:18, Scott Lurndal wrote:
Michael S <already5chosen@yahoo.com> writes:
On Wed, 10 Dec 2025 15:07:30 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Michael Sanders <porkchop@invalid.foo> writes:
On Sat, 6 Dec 2025 02:00:22 -0000 (UTC), Lew Pitcher wrote:
I should have added that I feel that you probably haven't really >>>>>>> defined /what/ "text file" means, and that has interfered with
the development of this function. As Keith pointed out, the task >>>>>>> of distinguishing between a "text" file and a "binary" file is not >>>>>>> easy. I'll add that a lot of the difficulty stems from the fact
that there are many definitions (some conflicting) of what a "text" >>>>>>> file actually contains.
Yes. Here's my 2nd attempt following the template (of thinking)
you've suggested...
The problem with all of your attempts is the performance
issue. Success requires reading every single byte of the
file, one byte at a time. The word 'slow' is not sufficient
to describe how bad the performance will be for a very large
file.
At a minimum, dump the stdio double-buffered byte-by-byte
algorithm and use mmap().
I suggest to do actual speed measurements before making bold
claims like above. Don't trust your intuition!
I have, more than once, done such measurements after mmap()
was introduced in SVR4 circa 1989 (ported from SunOS).
On a single-user system, running a single job, the difference
for smaller files is in the noise. For larger files, or when
the system is heavily loaded or multiuser, it can be significant.
1989 is 36 years ago. Technology has moved on. If reading your file is too slow to read, get yourself a real computer.
On my very ordinary desktop machine, I just freq'd[1] a 7,032,963,565- byte file in 12.256 seconds. That's 573,838,410 bytes per second. It's a damn sight faster than I could do by hand.
How, exactly, are you using `slow'?
A getc loop took 4.3 seconds to read a 192MB file from SSD, on my Windows PC.
Under WSL it took 8.4 seconds (8.4/0.5 real/user).
However reading it all in one go took 0.14 seconds.
I guess not all 'getc' implementations are the same.
On Wed, 12/10/2025 5:37 PM, bart wrote:
A getc loop took 4.3 seconds to read a 192MB file from SSD, on my Windows PC.
Under WSL it took 8.4 seconds (8.4/0.5 real/user).
However reading it all in one go took 0.14 seconds.
I guess not all 'getc' implementations are the same.
#include <stdio.h>
#include <stdlib.h>
#include <windows.h>
/* gcc -Wl,--stack,1200000000 -o getcbench.exe getcbench.c */
int main(int argc, char **argv)
{ FILE* source;
int c; /* getc holder */
const int size = 1000*1000*1000;
char keep[size];
Add some code after the fopen.
if (setvbuf(source, NULL, _IOFBF, 65536) != 0)
{
fprintf(stderr, "setvbuf() failed\n\n" );
return -1;
}
Extra code was added so keep[] was not optimized away.
Read 1000000000 bytes in 001.075651 seconds
Busy sum = FFFFFFFFE216FE9C
That's getting close to a gigabyte per second.
Summary: The -O2 makes a BIG difference.
No idea how it is cheating.
On 11/12/2025 03:35, Paul wrote:Yes. Under Linux/x64 the default stack size is 8MiB, unter Windows/x64
On Wed, 12/10/2025 5:37 PM, bart wrote:
A getc loop took 4.3 seconds to read a 192MB file from SSD, on my
Windows PC.
Under WSL it took 8.4 seconds (8.4/0.5 real/user).
However reading it all in one go took 0.14 seconds.
I guess not all 'getc' implementations are the same.
#include <stdio.h>
#include <stdlib.h>
#include <windows.h>
/* gcc -Wl,--stack,1200000000 -o getcbench.exe getcbench.c */
int main(int argc, char **argv)
{ FILE* source;
int c; /* getc holder */
const int size = 1000*1000*1000;
char keep[size];
I didn't see the point of either keeping the array on the stack, or
using a VLA. I made it static. That also allowed me a choice of
compilers with no special options needed.
Add some code after the fopen.
if (setvbuf(source, NULL, _IOFBF, 65536) != 0)
{
fprintf(stderr, "setvbuf() failed\n\n" );
return -1;
}
When I added that, it slowed it down! Maybe it was already using a
bigger buffer.
Extra code was added so keep[] was not optimized away.
My loop didn't store the characters anywhere; it just bumped a count.
I think it was enough that it was calling an external function,
'getc'; a commpiler can't optimise that away.
Read 1000000000 bytes in 001.075651 seconds
Busy sum = FFFFFFFFE216FE9C
That's getting close to a gigabyte per second.
Summary: The -O2 makes a BIG difference.
No idea how it is cheating.
How a look at the generated assembly: is it still making an actual
call to 'getc', or has it been inlined?
In my case -O2 made little difference, and it was still calling
getc(). -O2 can't effect such a precompiled function, unless getc() is
not really an external function: either a macro, or a wrapper.
Also, the generated EXE file actually imports getc from msvcrt.dll,
which is a library not known to be performant.
[...]
if (c >= 128 || (c < 128 && (MASK[c >> 3] & (1 << (c & 7))))) good++;
[...]
Am I close? Missing anything you'd consider to be (or not) needed?
On 2025-12-06, Michael Sanders <porkchop@invalid.foo> wrote:
Am I close? Missing anything you'd consider to be (or not) needed?Hi Michael,
I contract for the the defense industry and badly need this function!
I am working with proposed code like:
if (is_binary_file(arg))
launch_nuclear_strike();
So I'm really sweating over the implementation, as you can imagine.
This thread has been very helpful.
I'm still leaning toward my paranoid functionw hich just checks that
every bit of every byte is either 0 or 1 to confirm that the binary
system is used.
In the I/O error case, I will cautiously return a a true value; we would
not want our side to lose due to a storage hardware issue.
On 2025-12-06, Michael Sanders <porkchop@invalid.foo> wrote:
Am I close? Missing anything you'd consider to be (or not) needed?
Hi Michael,
I contract for the the defense industry and badly need this function!
I am working with proposed code like:
if (is_binary_file(arg))
launch_nuclear_strike();
So I'm really sweating over the implementation, as you can imagine.
This thread has been very helpful.
I'm still leaning toward my paranoid functionw hich just checks that
every bit of every byte is either 0 or 1 to confirm that the binary
system is used.
In the I/O error case, I will cautiously return a a true value; we would
not want our side to lose due to a storage hardware issue.
I'm still leaning toward my paranoid functionw hich just checks thatI'd be very interested in seeing how you implement that test, and even
every bit of every byte is either 0 or 1 to confirm that the binary
system is used.
On 2025-12-06, Michael Sanders <porkchop@invalid.foo> wrote:
Am I close? Missing anything you'd consider to be (or not) needed?
Hi Michael,
I contract for the the defense industry and badly need this function!
I am working with proposed code like:
if (is_binary_file(arg))
launch_nuclear_strike();
[...]
In the I/O error case, I will cautiously return a a true value; we would
not want our side to lose due to a storage hardware issue.
static const uint8_t map[256] = {...
[...]
fclose(f);
return 1; // probally text
}
On 12/12/2025 2:54 PM, Michael Sanders wrote:
[...]
fclose(f);
return 1; // probally text
}
define the probability? Say in 0...1?
[...]
On Fri, 12 Dec 2025 15:33:01 -0800, Chris M. Thomasson wrote:
On 12/12/2025 2:54 PM, Michael Sanders wrote:
[...]
fclose(f);
return 1; // probally text
}
define the probability? Say in 0...1?
[...]
Add it Chris & I'll roll it in =)
Me? I'd go with steps of say, 10% just to
make it human-friendly, but that's just me.
| Sysop: | DaiTengu |
|---|---|
| Location: | Appleton, WI |
| Users: | 1,089 |
| Nodes: | 10 (0 / 10) |
| Uptime: | 153:54:19 |
| Calls: | 13,921 |
| Calls today: | 2 |
| Files: | 187,021 |
| D/L today: |
3,760 files (944M bytes) |
| Messages: | 2,457,163 |