If you manage an improvement, please do post it here in the group
so I can learn more too.
On Sat, 15 Nov 2025 06:24:39 +0100, Bonita Montero wrote:
A little bugfix and a perfect style:Very nice!
#include <iostream>
#include <bit>
#include <span>
#include <optional>
using namespace std;
optional<size_t> utf8Width( u8string_view str )
{
size_t w = 0;
for( auto it = str.begin(); it != str.end(); ++w ) [[likely]]
if( size_t head = countl_zero( (unsigned char)~*it ); head <= 4
&& (size_t)(str.end() - it) >= head + 1 ) [[likely]]
it += head + 1;
else
return nullopt;
return w;
}
int main()
{
cout << *utf8Width( u8"Hello, 世界!" ) << endl;
}
On Tue, 18 Nov 2025 14:27:53 -0500, James Kuyper wrote:
Could you identify which document guarantees that every Unicode locale
contains "UTF-8"? Do you know what the domain of applicability of that
document is? It apparently does not cover my Ubuntu Linux system. The
command "locale -a" provides a list of all supported locales. Here's
what it says:
[...]
Hi James, umm 'guarantees'? No no... It does NOT verify:
- whether the environment actually supports UTF8 fully
- whether multibyte functions are enabled
- whether the terminal supports UTF8
- whether the C library supports UTF8 normalization
(combining characters, etc. but it seems to work well here)
To be sure: It's not a UTF-8 capability test. It's only a
locale-string check. So it likely misses many valid UTF8
locale variants...
Here I'm running any mixture of: Windows/BSD/Linix Mint LMDE.
The best I can tell you at this stage is that it works on my end,
not a very satisfying reply I'm sure you'd agree. But till I learn
more about the issue that's the best I can offer.
If you manage an improvement, please do post it here in the group
so I can learn more too.
On 2025-11-18 15:17, Michael Sanders wrote:
On Tue, 18 Nov 2025 14:27:53 -0500, James Kuyper wrote:
Could you identify which document guarantees that every Unicode locale contains "UTF-8"? Do you know what the domain of applicability of that document is? It apparently does not cover my Ubuntu Linux system. The command "locale -a" provides a list of all supported locales. Here's
what it says:
[...]
Hi James, umm 'guarantees'? No no... It does NOT verify:
- whether the environment actually supports UTF8 fully
- whether multibyte functions are enabled
- whether the terminal supports UTF8
- whether the C library supports UTF8 normalization
(combining characters, etc. but it seems to work well here)
To be sure: It's not a UTF-8 capability test. It's only a
locale-string check. So it likely misses many valid UTF8
locale variants...
If intended for use by anyone other than yourself, you should document
it's limitations in that regard, either with in-code comments or in user documentation.
Here I'm running any mixture of: Windows/BSD/Linix Mint LMDE.
The best I can tell you at this stage is that it works on my end,
not a very satisfying reply I'm sure you'd agree. But till I learn
more about the issue that's the best I can offer.
If you manage an improvement, please do post it here in the group
so I can learn more too.
There might be documents specifying locale naming standards, but I'm not aware of any. [...]
If your targets include Linux Mint, there's a chance the locale names
might be similar to those on my Ubuntu Linux system - but I'm no expert
on the differences between Linux distributions. If so, you should make
the "UTF" search case-insensitive, and make the '-' optional, which
would add considerable complexity to what is currently a very simple
routine.
[...]
Even cooler. Now the code accepts usual string_views as well as u8string_views.
And if you supply a boolean temlpate parameter before the ()-parameter which is true the data is verified to be a valid UTF-8 string. If you supply false or omit the parameter the string isn't valiedated.
Hi Bonita! These are nice c++/c examples you've provided.
Thanks for your input, I appreciate your code & remarks.
A little bugfix and a perfect style:
#include <iostream>
#include <bit>
#include <span>
#include <optional>
using namespace std;
optional<size_t> utf8Width( u8string_view str )
{
size_t w = 0;
for( auto it = str.begin(); it != str.end(); ++w ) [[likely]]
if( size_t head = countl_zero( (unsigned char)~*it ); head <= 4 && (size_t)(str.end() - it) >= head + 1 ) [[likely]]
it += head + 1;
else
return nullopt;
return w;
}
int main()
{
cout << *utf8Width( u8"Hello, 世界!" ) << endl;
}
size_t utf8width(char* s) {
size_t length;
int c, n;
length=0;
while (c=*s) {
if ((c & 0x80) == 0) n = 1;
else if ((c & 0xE0) == 0xC0) n = 2;
else if ((c & 0xF0) == 0xE0) n = 3;
else n = 4;
s += n;
++length;
}
return length;
}
On 15/11/2025 05:24, Bonita Montero wrote:
A little bugfix and a perfect style:
#include <iostream>
#include <bit>
#include <span>
#include <optional>
using namespace std;
optional<size_t> utf8Width( u8string_view str )
{
size_t w = 0;
for( auto it = str.begin(); it != str.end(); ++w ) [[likely]]
if( size_t head = countl_zero( (unsigned char)~*it ); head
<= 4 && (size_t)(str.end() - it) >= head + 1 ) [[likely]]
it += head + 1;
else
return nullopt;
return w;
}
int main()
{
cout << *utf8Width( u8"Hello, 世界!" ) << endl;
}
The trouble with this is that I haven't a clue how it works or what
those extras do, or how they impact on performance.
A version in C is given below. This is much more straightforward. It
doesn't verify anything, but then I don't know if yours does either.
As for performance: I duplicated that test string to form one 104
times as long, then called that function one million times. Here are
the timings:
C gcc-O2 1.06 seconds
C bcc 1.17 seconds
C tcc 2.81 seconds
C++ g++-O2 4.6 seconds
C++ g++-O0 19 seconds
--------------------------
size_t utf8width(char* s) {
size_t length;
int c, n;
length=0;
while (c=*s) {
if ((c & 0x80) == 0) n = 1;
else if ((c & 0xE0) == 0xC0) n = 2;
else if ((c & 0xF0) == 0xE0) n = 3;
else n = 4;
s += n;
++length;
}
return length;
}
Am 21.11.2025 um 18:03 schrieb bart:
On 15/11/2025 05:24, Bonita Montero wrote:Take a string of a number of UTF-8 characters with a proper
A little bugfix and a perfect style:
#include <iostream>
#include <bit>
#include <span>
#include <optional>
using namespace std;
optional<size_t> utf8Width( u8string_view str )
{
size_t w = 0;
for( auto it = str.begin(); it != str.end(); ++w ) [[likely]]
if( size_t head = countl_zero( (unsigned char)~*it ); head >>> <= 4 && (size_t)(str.end() - it) >= head + 1 ) [[likely]]
it += head + 1;
else
return nullopt;
return w;
}
int main()
{
cout << *utf8Width( u8"Hello, 世界!" ) << endl;
}
The trouble with this is that I haven't a clue how it works or what
those extras do, or how they impact on performance.
A version in C is given below. This is much more straightforward. It
doesn't verify anything, but then I don't know if yours does either.
As for performance: I duplicated that test string to form one 104
times as long, then called that function one million times. Here are
the timings:
C gcc-O2 1.06 seconds
C bcc 1.17 seconds
C tcc 2.81 seconds
C++ g++-O2 4.6 seconds
C++ g++-O0 19 seconds
--------------------------
size_t utf8width(char* s) {
size_t length;
int c, n;
length=0;
while (c=*s) {
if ((c & 0x80) == 0) n = 1;
else if ((c & 0xE0) == 0xC0) n = 2;
else if ((c & 0xF0) == 0xE0) n = 3;
else n = 4;
s += n;
++length;
}
return length;
}
mixed chunk-lengths.
This code with AVX512BW and BMI1 is 13,5 times faster than yours on my Zen4-PC.
size_t utf8Width2( const char *s )
{
__m512i const
ZERO = _mm512_setzero_si512(),
ONE_MASK = _mm512_set1_epi8( (char)0x80 ),
ONE_HEAD = ZERO,
TWO_MASK = _mm512_set1_epi8( (char)0xE0 ),
TWO_HEAD = _mm512_set1_epi8( (char)0xC0 ),
THREE_MASK = _mm512_set1_epi8( (char)0xF0 ),
THREE_HEAD = _mm512_set1_epi8( (char)0xE0 ),
FOUR_MASK = _mm512_set1_epi8( (char)0xF8 ),
FOUR_HEAD = _mm512_set1_epi8( (char)0xF0 );
uintptr_t
begin = (uintptr_t)s,
base = begin & -64;
s = (char *)base;
size_t count = 0;
__m512i chunk;
uint64_t nzMask;
auto doChunk = [&]() L_FORCEINLINE
{
uint64_t
one = _mm512_cmpeq_epi8_mask( _mm512_and_si512( chunk, ONE_MASK ), ONE_HEAD ) & nzMask,
two = _mm512_cmpeq_epi8_mask( _mm512_and_si512( chunk, TWO_MASK ), TWO_HEAD ) & nzMask,
three = _mm512_cmpeq_epi8_mask( _mm512_and_si512( chunk, THREE_MASK ), THREE_HEAD ) & nzMask,
four = _mm512_cmpeq_epi8_mask( _mm512_and_si512( chunk, FOUR_MASK ), FOUR_HEAD ) & nzMask;
count += _mm_popcnt_u64( one ) + _mm_popcnt_u64( two ) + _mm_popcnt_u64( three ) + _mm_popcnt_u64( four );
};
chunk = _mm512_loadu_si512( s );
unsigned head = (unsigned)(begin - base);
nzMask = ~_mm512_cmpeq_epi8_mask( chunk, ZERO ) >> head;
unsigned ones = countr_one( nzMask );
nzMask &= ones < 64 ? (1ull << ones) - 1 : -1;
nzMask <<= head;
doChunk();
if( (int64_t)nzMask >= 0 )
return count;
for( ; ; )
{
s += 64;
chunk = _mm512_loadu_si512( s );
nzMask = ~_mm512_cmpeq_epi8_mask( chunk, ZERO );
ones = countr_one( nzMask );
nzMask = ones < 64 ? (1ull << ones) - 1 : -1;
if( !nzMask )
break;
doChunk();
}
return count;
}
Doesn't compile, even after I add suitable *intrin headers.
I took out L_FORCEINLINE (not recognised); added std:: to countr_one,
but it still gave me errors like this: C:/tdm/lib/gcc/x86_64-w64-mingw32/14.1.0/include/popcntintrin.h: In
lambda function: C:/tdm/lib/gcc/x86_64-w64-mingw32/14.1.0/include/popcntintrin.h:42:1:
error: inlining failed in call to 'always_inline' 'long long int _mm_popcnt_u64(long long unsigned int)': target specific option mismatch
42 | _mm_popcnt_u64 (unsigned long long __X)
| ^~~~~~~~~~~~~~
You have to give complete compilable code or have only simple
dependencies like stdio.h.
unsigned ones = countr_one( nzMask );head;
Am 22.11.2025 um 14:38 schrieb bart:
Doesn't compile, even after I add suitable *intrin headers.
I took out L_FORCEINLINE (not recognised); added std:: to countr_one,
but it still gave me errors like this:
C:/tdm/lib/gcc/x86_64-w64-mingw32/14.1.0/include/popcntintrin.h: In
lambda function:
C:/tdm/lib/gcc/x86_64-w64-mingw32/14.1.0/include/popcntintrin.h:42:1:
error: inlining failed in call to 'always_inline' 'long long int
_mm_popcnt_u64(long long unsigned int)': target specific option mismatch
42 | _mm_popcnt_u64 (unsigned long long __X)
| ^~~~~~~~~~~~~~
You have to give complete compilable code or have only simple
dependencies like stdio.h.
Try __attribute__((always_inline)) instead. The code requires enabled
AVX512 compilation
with g++ and a AVX512-compatible CPU (Intel since Skylake-X Xeons, AMD
since Zen4).
If you want to test for an older CPU you can stick with the below code, which is AVX2.
Still doesn't work. I'm using g++ 14.1.0. It doesn't like 'countr_one'-std=c++20
with or without std::
Would it hurt to post a complete, compilable program? Plus theI'm using Visual C++ or clang-cl (MSVC-compatible clang).
compiler invocation if it needs anything unusual.
It only needs a minimal main() routine which I can tweak to my testIt works the same as your code, i.e. it takes a char-pointer.
input. Unless all it needs to use it is a call to utf8Width("abc")
which returns a simple integer.
But ATM my C version is still faster!For sure not that fast as my AVX (seven times) / AVX-512 (13,5 times)
unsigned ones = countr_one( nzMask );head;
Take this and -mavx512bw and -std=c++23.
#include <iostream>
#include <string_view>
#include <bit>
#include <algorithm>
#include <random>
#include <array>
#include <span>
#include <chrono>
#if defined(_MSC_VER)
#include <intrin.h>
#elif defined(__GNUC__) || defined(__clang__)
#include <x86intrin.h>
#endif
#include "inline.h"
On 22/11/2025 15:05, Bonita Montero wrote:
Take this and -mavx512bw and -std=c++23.
#include <iostream>
#include <string_view>
#include <bit>
#include <algorithm>
#include <random>
#include <array>
#include <span>
#include <chrono>
#if defined(_MSC_VER)
#include <intrin.h>
#elif defined(__GNUC__) || defined(__clang__)
#include <x86intrin.h>
#endif
#include "inline.h"
I don't have 'inline.h'. If I comment that out, then I get the errors
below from 'g++ -std=c++23 prog.c', also with -Wno-inline.
Your code seems incredibly fragile.
c.cpp: In function 'size_t utf8Width512(const char*)':
c.cpp:72:37: warning: AVX512F vector return without AVX512F enabled
changes the ABI [-Wpsabi]
72 | ZERO = _mm512_setzero_si512(),
| ^
c.cpp: In function 'size_t utf8Width256(const char*)':
c.cpp:123:37: warning: AVX vector return without AVX enabled changes
the ABI [-Wpsabi]
123 | ZERO = _mm256_setzero_si256(),
| ^
In file included from C:/tdm/lib/gcc/x86_64-w64-mingw32/14.1.0/include/x86gprintrin.h:73, from C:/tdm/lib/gcc/x86_64-w64-mingw32/14.1.0/include/x86intrin.h:27, from c.cpp:13: C:/tdm/lib/gcc/x86_64-w64-mingw32/14.1.0/include/popcntintrin.h: In
lambda function: C:/tdm/lib/gcc/x86_64-w64-mingw32/14.1.0/include/popcntintrin.h:42:1:
error: inlining failed in call to 'always_inline' 'long long int _mm_popcnt_u64(long long unsigned int)': target specific option mismatch
42 | _mm_popcnt_u64 (unsigned long long __X)
| ^~~~~~~~~~~~~~
c.cpp:95:106: note: called from here
95 | count += _mm_popcnt_u64( one ) + _mm_popcnt_u64( two )
+ _mm_popcnt_u64( three ) + _mm_popcnt_u64( four );
| ~~~~~~~~~~~~~~^~~~~~~~
C:/tdm/lib/gcc/x86_64-w64-mingw32/14.1.0/include/popcntintrin.h:42:1:
error: inlining failed in call to 'always_inline' 'long long int _mm_popcnt_u64(long long unsigned int)': target specific option mismatch
42 | _mm_popcnt_u64 (unsigned long long __X)
| ^~~~~~~~~~~~~~
c.cpp:95:80: note: called from here
95 | count += _mm_popcnt_u64( one ) + _mm_popcnt_u64( two )
+ _mm_popcnt_u64( three ) + _mm_popcnt_u64( four );
| ~~~~~~~~~~~~~~^~~~~~~~~ C:/tdm/lib/gcc/x86_64-w64-mingw32/14.1.0/include/popcntintrin.h:42:1:
error: inlining failed in call to 'always_inline' 'long long int _mm_popcnt_u64(long long unsigned int)': target specific option mismatch
42 | _mm_popcnt_u64 (unsigned long long __X)
| ^~~~~~~~~~~~~~
c.cpp:95:56: note: called from here
95 | count += _mm_popcnt_u64( one ) + _mm_popcnt_u64( two )
+ _mm_popcnt_u64( three ) + _mm_popcnt_u64( four );
| ~~~~~~~~~~~~~~^~~~~~~ C:/tdm/lib/gcc/x86_64-w64-mingw32/14.1.0/include/popcntintrin.h:42:1:
error: inlining failed in call to 'always_inline' 'long long int _mm_popcnt_u64(long long unsigned int)': target specific option mismatch
42 | _mm_popcnt_u64 (unsigned long long __X)
| ^~~~~~~~~~~~~~
c.cpp:95:32: note: called from here
95 | count += _mm_popcnt_u64( one ) + _mm_popcnt_u64( two )
+ _mm_popcnt_u64( three ) + _mm_popcnt_u64( four );
| ~~~~~~~~~~~~~~^~~~~~~
In file included from C:/tdm/lib/gcc/x86_64-w64-mingw32/14.1.0/include/immintrin.h:65, from C:/tdm/lib/gcc/x86_64-w64-mingw32/14.1.0/include/x86intrin.h:32: C:/tdm/lib/gcc/x86_64-w64-mingw32/14.1.0/include/avx512bwintrin.h:1716:1: error: inlining failed in call to 'always_inline' '__mmask64 _mm512_cmpeq_epi8_mask(__m512i, __m512i)': target specific option
mismatch
1716 | _mm512_cmpeq_epi8_mask (__m512i __A, __m512i __B)
| ^~~~~~~~~~~~~~~~~~~~~~
c.cpp:94:42: note: called from here
94 | four = _mm512_cmpeq_epi8_mask( _mm512_and_si512(
chunk, FOUR_MASK ), FOUR_HEAD ) & nzMask;
| ~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from C:/tdm/lib/gcc/x86_64-w64-mingw32/14.1.0/include/immintrin.h:55: C:/tdm/lib/gcc/x86_64-w64-mingw32/14.1.0/include/avx512fintrin.h:10651:1: error: inlining failed in call to 'always_inline' '__m512i _mm512_and_si512(__m512i, __m512i)': target specific option mismatch
10651 | _mm512_and_si512 (__m512i __A, __m512i __B)
| ^~~~~~~~~~~~~~~~
c.cpp:94:42: note: called from here
94 | four = _mm512_cmpeq_epi8_mask( _mm512_and_si512(
chunk, FOUR_MASK ), FOUR_HEAD ) & nzMask;
| ~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ C:/tdm/lib/gcc/x86_64-w64-mingw32/14.1.0/include/avx512bwintrin.h:1716:1: error: inlining failed in call to 'always_inline' '__mmask64 _mm512_cmpeq_epi8_mask(__m512i, __m512i)': target specific option
mismatch
1716 | _mm512_cmpeq_epi8_mask (__m512i __A, __m512i __B)
| ^~~~~~~~~~~~~~~~~~~~~~
c.cpp:93:43: note: called from here
93 | three = _mm512_cmpeq_epi8_mask( _mm512_and_si512(
chunk, THREE_MASK ), THREE_HEAD ) & nzMask,
| ~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ C:/tdm/lib/gcc/x86_64-w64-mingw32/14.1.0/include/avx512fintrin.h:10651:1: error: inlining failed in call to 'always_inline' '__m512i _mm512_and_si512(__m512i, __m512i)': target specific option mismatch
10651 | _mm512_and_si512 (__m512i __A, __m512i __B)
| ^~~~~~~~~~~~~~~~
c.cpp:93:43: note: called from here
93 | three = _mm512_cmpeq_epi8_mask( _mm512_and_si512(
chunk, THREE_MASK ), THREE_HEAD ) & nzMask,
| ~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ C:/tdm/lib/gcc/x86_64-w64-mingw32/14.1.0/include/avx512bwintrin.h:1716:1: error: inlining failed in call to 'always_inline' '__mmask64 _mm512_cmpeq_epi8_mask(__m512i, __m512i)': target specific option
mismatch
1716 | _mm512_cmpeq_epi8_mask (__m512i __A, __m512i __B)
| ^~~~~~~~~~~~~~~~~~~~~~
c.cpp:92:41: note: called from here
92 | two = _mm512_cmpeq_epi8_mask( _mm512_and_si512(
chunk, TWO_MASK ), TWO_HEAD ) & nzMask,
| ~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ C:/tdm/lib/gcc/x86_64-w64-mingw32/14.1.0/include/avx512fintrin.h:10651:1: error: inlining failed in call to 'always_inline' '__m512i _mm512_and_si512(__m512i, __m512i)': target specific option mismatch
10651 | _mm512_and_si512 (__m512i __A, __m512i __B)
| ^~~~~~~~~~~~~~~~
c.cpp:92:41: note: called from here
92 | two = _mm512_cmpeq_epi8_mask( _mm512_and_si512(
chunk, TWO_MASK ), TWO_HEAD ) & nzMask,
| ~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ C:/tdm/lib/gcc/x86_64-w64-mingw32/14.1.0/include/avx512bwintrin.h:1716:1: error: inlining failed in call to 'always_inline' '__mmask64 _mm512_cmpeq_epi8_mask(__m512i, __m512i)': target specific option
mismatch
1716 | _mm512_cmpeq_epi8_mask (__m512i __A, __m512i __B)
| ^~~~~~~~~~~~~~~~~~~~~~
c.cpp:91:41: note: called from here
91 | one = _mm512_cmpeq_epi8_mask( _mm512_and_si512(
chunk, ONE_MASK ), ONE_HEAD ) & nzMask,
| ~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ C:/tdm/lib/gcc/x86_64-w64-mingw32/14.1.0/include/avx512fintrin.h:10651:1: error: inlining failed in call to 'always_inline' '__m512i _mm512_and_si512(__m512i, __m512i)': target specific option mismatch
10651 | _mm512_and_si512 (__m512i __A, __m512i __B)
| ^~~~~~~~~~~~~~~~
c.cpp:91:41: note: called from here
91 | one = _mm512_cmpeq_epi8_mask( _mm512_and_si512(
chunk, ONE_MASK ), ONE_HEAD ) & nzMask,
| ~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can compile the code with -mavx512bw.
This is "inline.h":
On 22/11/2025 17:13, Bonita Montero wrote:
You can compile the code with -mavx512bw.
This is "inline.h":
But I now get, from:
g++ =std=c++23 -mavx512bw -O2 c.cpp
the errors shown below. I tried -fconcepts too.
So, what also do I need? (So far you're not selling C++ very well!)
On 22/11/2025 17:13, Bonita Montero wrote:
You can compile the code with -mavx512bw.
This is "inline.h":
But I now get, from:
g++ =std=c++23 -mavx512bw -O2 c.cpp
the errors shown below. I tried -fconcepts too.
So, what also do I need? (So far you're not selling C++ very well!)
---------------------------------
c.cpp:33:54: warning: use of C++23 'make_signed_t<size_t>' integer
constant
33 | if( (*it & 0xC0) == 0x80 || width > min( 4Z, rem )
) [[unlikely]]
| ^~
c.cpp:24:5: error: 'requires' does not name a type
24 | requires std::same_as<View, string_view> || std::same_as<View, u8string_view>
| ^~~~~~~~
c.cpp:24:5: note: 'requires' only available with '-std=c++20' or '-fconcepts'
c.cpp: In function 'size_t utf8widthC(const char*)':
c.cpp:52:10: error: 'char8_t' was not declared in this scope; did you
mean 'wchar_t'?
52 | for( char8_t c; (c = *str); ++length )
| ^~~~~~~
| wchar_t
c.cpp:52:22: error: 'c' was not declared in this scope
52 | for( char8_t c; (c = *str); ++length )
| ^
c.cpp: In function 'size_t utf8Width512(const char*)':
c.cpp:99:21: error: 'countr_one' was not declared in this scope
99 | unsigned ones = countr_one( nzMask );
| ^~~~~~~~~~
c.cpp: In function 'size_t utf8Width256(const char*)':
c.cpp:150:21: error: 'countr_one' was not declared in this scope
150 | unsigned ones = countr_one( nzMask );
| ^~~~~~~~~~
c.cpp: In function 'int main()':
c.cpp:192:5: error: 'span' was not declared in this scope
192 | span ranges( rawRanges );
| ^~~~
c.cpp:192:5: note: 'std::span' is only available from C++20 onwards c.cpp:193:5: error: 'char8_t' was not declared in this scope; did you
mean 'wchar_t'?
193 | char8_t rawTypeHeads[4] { 0, 0xC0, 0xE0, 0xF0 };
| ^~~~~~~
| wchar_t
c.cpp:194:9: error: expected ';' before 'typeHeads'
194 | span typeHeads( rawTypeHeads );
| ^~~~~~~~~~
| ;
c.cpp:196:5: error: 'u8string' was not declared in this scope
196 | u8string u8Str( BUF_MIN + 3, (char8_t)0 );
| ^~~~~~~~
c.cpp:196:5: note: 'std::u8string' is only available from C++20 onwards c.cpp:197:20: error: 'u8string' does not name a type
197 | using u8s_it = u8string::iterator;
| ^~~~~~~~
c.cpp:198:5: error: 'u8s_it' was not declared in this scope
198 | u8s_it
| ^~~~~~
c.cpp:201:30: error: 'itChar' was not declared in this scope
201 | for( size_t width, type; itChar < itCharEnd; itChar += width )
| ^~~~~~ c.cpp:201:39: error: 'itCharEnd' was not declared in this scope
201 | for( size_t width, type; itChar < itCharEnd; itChar += width )
| ^~~~~~~~~
c.cpp:205:23: error: 'ranges' was not declared in this scope; did you
mean 'rawRanges'?
205 | char32_t c = (ranges[type])( mt );
| ^~~~~~
| rawRanges c.cpp:206:20: error: expected ';' before 'itTail'
206 | for( u8s_it itTail = itChar + width; --itTail > itChar; c >>= 6 )
| ^~~~~~~
| ;
c.cpp:206:48: error: 'itTail' was not declared in this scope
206 | for( u8s_it itTail = itChar + width; --itTail > itChar; c >>= 6 )
| ^~~~~~
c.cpp:208:19: error: 'typeHeads' was not declared in this scope
208 | *itChar = typeHeads[type] | (char8_t)c;
| ^~~~~~~~~
c.cpp:210:5: error: 'u8Str' was not declared in this scope
210 | u8Str.resize( itChar - u8Str.begin() );
| ^~~~~
c.cpp:210:19: error: 'itChar' was not declared in this scope
210 | u8Str.resize( itChar - u8Str.begin() );
| ^~~~~~
c.cpp:228:25: error: 'u8string' is not a type
228 | bench( "my: ", [&]( u8string const &str ) { total += utf8Width256( (char *)str.c_str() ); } );
| ^~~~~~~~
c.cpp: In lambda function:
c.cpp:228:84: error: request for member 'c_str' in 'str', which is of non-class type 'const int'
228 | bench( "my: ", [&]( u8string const &str ) { total += utf8Width256( (char *)str.c_str() ); } );
| ^~~~~
c.cpp: In function 'int main()':
c.cpp:229:27: error: 'u8string' is not a type
229 | bench( "nerd: ", [&]( u8string const &str ) { total += utf8widthC( (char *)str.c_str() ); } );
| ^~~~~~~~ c.cpp: In lambda function:
c.cpp:229:84: error: request for member 'c_str' in 'str', which is of non-class type 'const int'
229 | bench( "nerd: ", [&]( u8string const &str ) { total += utf8widthC( (char *)str.c_str() ); } );
| ^~~~~
A lot of errors look like that you haven't enable at C++23 properly.
Can you install a current g++ ? Maybe the newest from the repository
is sufficient.
On 22/11/2025 17:44, Bonita Montero wrote:
A lot of errors look like that you haven't enable at C++23 properly.
Can you install a current g++ ? Maybe the newest from the repository
is sufficient.
I said in a followup that I'd typed =std instead of -std, which didn't generate any error from the compiler.
But I managed to compile it. However the long program with a
complicated main() just crashed trying to run it, sometime before it
got to the actual UTF8 bit.
So I applied those headers and options to the first mm512
single-function version you posted. There I only had to add std:: to
those countr.one's.
I used this test driver
int main() {
size_t n = 0;
n = utf8Width("Hello, 世界!" );
printf("%zu\n", n);
}
And it crashes inside that function.
It's all just too damn complicated, sorry. It might well be fast, but
that's no good if it is troublesome to build and run for anyone else.
Another factor is this: each build, even at -O0, takes 3 whole seconds
on my machine. That must be a huge pile of junk it is including.
Building my C version takes some 1/20th of a second (even gcc takes
only 0.3 seconds).
On 22/11/2025 17:35, bart wrote:[...]
On 22/11/2025 17:13, Bonita Montero wrote:
You can compile the code with -mavx512bw.But I now get, from:
This is "inline.h":
g++ =std=c++23 -mavx512bw -O2 c.cpp
the errors shown below. I tried -fconcepts too.
So, what also do I need? (So far you're not selling C++ very well!)
Wait, there's a "=std" in that command line instead of
"-std". Apparently it is not an error (?).
bart <bc@freeuk.com> writes:
On 22/11/2025 17:35, bart wrote:[...]
On 22/11/2025 17:13, Bonita Montero wrote:
You can compile the code with -mavx512bw.But I now get, from:
This is "inline.h":
g++ =std=c++23 -mavx512bw -O2 c.cpp
the errors shown below. I tried -fconcepts too.
So, what also do I need? (So far you're not selling C++ very well!)
Wait, there's a "=std" in that command line instead of
"-std". Apparently it is not an error (?).
It seems that gcc and g++ interpret any unrecognized command line
argument as the name of a "linker input file".
BTW, comp.lang.c++ is down the hall, just past the water cooler.
static int utf8_width(const char *s) {Do you need this to work under non-UTF-8 locales? If you only need that
int w = 0;
const unsigned char *p = (const unsigned char *)s;
while (*p) {
if (*p < 0x80) { w++; p++; } // ASCII 1-byte
else if ((*p & 0xE0) == 0xC0) { w++; p += 2; } // 2-byte UTF-8
else if ((*p & 0xF0) == 0xE0) { w++; p += 3; } // 3-byte UTF-8
else if ((*p & 0xF8) == 0xF0) { w++; p += 4; } // 4-byte UTF-8
else { w++; p++; } // fallback
}
return w;
}
On 22/11/2025 23:24, Keith Thompson wrote:
bart <bc@freeuk.com> writes:
On 22/11/2025 17:35, bart wrote:[...]
On 22/11/2025 17:13, Bonita Montero wrote:
You can compile the code with -mavx512bw.But I now get, from:
This is "inline.h":
g++ =std=c++23 -mavx512bw -O2 c.cpp
the errors shown below. I tried -fconcepts too.
So, what also do I need? (So far you're not selling C++ very well!)
Wait, there's a "=std" in that command line instead of
"-std". Apparently it is not an error (?).
It seems that gcc and g++ interpret any unrecognized command line
argument as the name of a "linker input file".
It looks like it compiles any source code first, so won't get around to reporting an error if that compilation fails.
BTW, comp.lang.c++ is down the hall, just past the water cooler.
This was supposed be about comparing a C approach to C++. Except there
were problems in getting the 'fast' C++ code to compile and then to run.
I think I'll stick with the simple C version which can also be trivially ported to any language as there are no heavy dependencies.
Do you need this to work under non-UTF-8 locales? If you only need that length when the locale is UTF-8, why not just use mblen from stdlib.h?
Hi James, umm 'guarantees'? No no... It does NOT verify:Windows has the ...W() APIs along with codepage-based APIs with
- whether the environment actually supports UTF8 fully
- whether multibyte functions are enabled
- whether the terminal supports UTF8
- whether the C library supports UTF8 normalization
(combining characters, etc. but it seems to work well here)
To be sure: It's not a UTF-8 capability test. It's only a
locale-string check. So it likely misses many valid UTF8
locale variants...
Here I'm running any mixture of: Windows/BSD/Linix Mint LMDE.
Here I'm running any mixture of: Windows/BSD/Linix Mint LMDE.
Windows has the ...W() APIs along with codepage-based APIs with
the ...A() Suffix. The W()-APIs support UTF-16, so no need for
We want portability across diverse OSs. In my case, the program
does NOT care what the character is, it simply needs to be able
to find it when searching data & displaying it in an ordered way.
The code below works perfectly:
#include <stdio.h>
#include <string.h>
int utf8_display_width(const char *s) {
int w = 0;
while (*s) {
unsigned char b = *s;
unsigned cp;
int n;
// UTF-8 decoder
if (b <= 0x7F) { // 1-byte ASCII
cp = b;
n = 1;
} else if (b >= 0xC0 && b <= 0xDF) { // 2-byte
cp = ((b & 0x1F) << 6) |
(s[1] & 0x3F);
n = 2;
} else if (b >= 0xE0 && b <= 0xEF) { // 3-byte
cp = ((b & 0x0F) << 12) |
((s[1] & 0x3F) << 6) |
(s[2] & 0x3F);
n = 3;
} else if (b >= 0xF0 && b <= 0xF7) { // 4-byte
cp = ((b & 0x07) << 18) |
((s[1] & 0x3F) << 12) |
((s[2] & 0x3F) << 6) |
(s[3] & 0x3F);
n = 4;
} else { // invalid, treat as 1-byte
cp = b;
n = 1;
}
// display width
if (cp >= 0x0300 && cp <= 0x036F) {} // combining marks like é (zero
width)
else if ( // double-width characters...
(cp >= 0x1100 && cp <= 0x115F) || // hangul jamo
(cp >= 0x2E80 && cp <= 0xA4CF) || // cjk radicals & unified ideographs
(cp >= 0xAC00 && cp <= 0xD7A3) || // hangul syllables
(cp >= 0xF900 && cp <= 0xFAFF) || // cjk compatibility ideographs
(cp >= 0x1F300 && cp <= 0x1FAFF) // emoji + symbols
) { w += 2; }
// exceptional wide characters (unicode requirement I've read elsewhere)
else if (cp == 0x2329 || cp == 0x232A) { w += 2; }
else { w += 1; } // normal width for everything else
s += n;
}
return w;
}
int main(void) {
const char *tests[] = {
"hello",
"Café",
"漢字",
"✓",
"🙂",
NULL
};
// find maximum display width in 1st column
int maxw = 0;
for (int i = 0; tests[i]; i++) {
int w = utf8_display_width(tests[i]);
if (w > maxw) maxw = w;
}
// total padding after each 1st column + 3 spaces
int total_pad = maxw + 3;
for (int i = 0; tests[i]; i++) {
int w = utf8_display_width(tests[i]);
int sl = strlen(tests[i]);
printf("%s", tests[i]);
int pad = total_pad - w;
while (pad-- > 0) putchar(' ');
printf("strlen: %d utf8 display width: %d\n", sl, w);
}
return 0;
}
// eof
On 2025-12-03 13:33, Michael Sanders wrote:
...
We want portability across diverse OSs. In my case, the program
does NOT care what the character is, it simply needs to be able
to find it when searching data & displaying it in an ordered way.
The code below works perfectly:
#include <stdio.h>
#include <string.h>
int utf8_display_width(const char *s) {
int w = 0;
while (*s) {
unsigned char b = *s;
unsigned cp;
int n;
// UTF-8 decoder
if (b <= 0x7F) { // 1-byte ASCII
cp = b;
n = 1;
} else if (b >= 0xC0 && b <= 0xDF) { // 2-byte
cp = ((b & 0x1F) << 6) |
(s[1] & 0x3F);
n = 2;
} else if (b >= 0xE0 && b <= 0xEF) { // 3-byte
cp = ((b & 0x0F) << 12) |
((s[1] & 0x3F) << 6) |
(s[2] & 0x3F);
n = 3;
} else if (b >= 0xF0 && b <= 0xF7) { // 4-byte
cp = ((b & 0x07) << 18) |
((s[1] & 0x3F) << 12) |
((s[2] & 0x3F) << 6) |
(s[3] & 0x3F);
n = 4;
} else { // invalid, treat as 1-byte
cp = b;
n = 1;
}
// display width
if (cp >= 0x0300 && cp <= 0x036F) {} // combining marks like é (zero
width)
else if ( // double-width characters...
(cp >= 0x1100 && cp <= 0x115F) || // hangul jamo
(cp >= 0x2E80 && cp <= 0xA4CF) || // cjk radicals & unified ideographs
(cp >= 0xAC00 && cp <= 0xD7A3) || // hangul syllables
(cp >= 0xF900 && cp <= 0xFAFF) || // cjk compatibility ideographs
(cp >= 0x1F300 && cp <= 0x1FAFF) // emoji + symbols
) { w += 2; }
// exceptional wide characters (unicode requirement I've read elsewhere)
else if (cp == 0x2329 || cp == 0x232A) { w += 2; }
else { w += 1; } // normal width for everything else
s += n;
}
return w;
}
int main(void) {
const char *tests[] = {
"hello",
"Café",
"漢字",
"✓",
"🙂",
NULL
};
// find maximum display width in 1st column
int maxw = 0;
for (int i = 0; tests[i]; i++) {
int w = utf8_display_width(tests[i]);
if (w > maxw) maxw = w;
}
// total padding after each 1st column + 3 spaces
int total_pad = maxw + 3;
for (int i = 0; tests[i]; i++) {
int w = utf8_display_width(tests[i]);
int sl = strlen(tests[i]);
printf("%s", tests[i]);
int pad = total_pad - w;
while (pad-- > 0) putchar(' ');
printf("strlen: %d utf8 display width: %d\n", sl, w);
}
return 0;
}
// eof
I find it confusing that this is supposed to "work perfectly" "across
diverse OSs". The amount of space that a character takes up varies
depending upon the installed fonts, especially on whether the font is monospaced or proportional. Those fonts can be different for display on screen or on a printer. I don't see any query to determine even what the current font is, much less what it's characteristics are. I don't know
of any OS-independent way of collecting such information. Does this
solution "work perfectly" only for your own particular favorite font?
This looks like a solution for a fixed-pitch font. I get this outputIt sounds as a luck. é in your text just happened to be encoded as
for a Windows console display (with - used for space):
hello---strlen: 5 utf8 display width: 5
Café----strlen: 5 utf8 display width: 4
漢字----strlen: 6 utf8 display width: 4
✓-------strlen: 3 utf8 display width: 1
🙂------strlen: 4 utf8 display width: 2
I was hoping this would be lined up, but already, in a Thunderbird
edit Window, the last lines aren't lined up properly.
Same problem with Notepad (fixed pitch) and LibreOffice (fixed pitch).
It only looks alright in Windows and WSL consoles/terminals. But
maybe that's all that's needed.
On 03/12/2025 19:01, James Kuyper wrote:[...]
[...]I find it confusing that this is supposed to "work perfectly"
"across
diverse OSs". The amount of space that a character takes up varies
depending upon the installed fonts, especially on whether the font is
monospaced or proportional. Those fonts can be different for display on
screen or on a printer. I don't see any query to determine even what the
current font is, much less what it's characteristics are. I don't know
of any OS-independent way of collecting such information. Does this
solution "work perfectly" only for your own particular favorite font?
This looks like a solution for a fixed-pitch font. I get this output
for a Windows console display (with - used for space):
I find it confusing that this is supposed to "work perfectly" "across
diverse OSs". The amount of space that a character takes up varies
depending upon the installed fonts, especially on whether the font is monospaced or proportional. Those fonts can be different for display on screen or on a printer. I don't see any query to determine even what the current font is, much less what it's characteristics are. I don't know
of any OS-independent way of collecting such information. Does this
solution "work perfectly" only for your own particular favorite font?
bart <bc@freeuk.com> writes:
On 03/12/2025 19:01, James Kuyper wrote:[...]
[...]I find it confusing that this is supposed to "work perfectly"
"across
diverse OSs". The amount of space that a character takes up varies
depending upon the installed fonts, especially on whether the font is
monospaced or proportional. Those fonts can be different for display on
screen or on a printer. I don't see any query to determine even what the >>> current font is, much less what it's characteristics are. I don't know
of any OS-independent way of collecting such information. Does this
solution "work perfectly" only for your own particular favorite font?
This looks like a solution for a fixed-pitch font. I get this output
for a Windows console display (with - used for space):
I think bart is right that this is specific to fixed-width fonts.
For a variable width font, 'W' is going to be wider than '|'.
See also the POSIX `int wcwidth(wchar_t wc)` function, which returns
the "number of column positions of a wide-character code". It does
depend on the current locale.
The assumption seems to be that fixed-width fonts are expected to be consistent about the widths of characters.
On Wed, 3 Dec 2025 06:24:23 +0100, Bonita Montero wrote:VC++ supports C- and C++ locale if you like to have it portable.
Hi Bonita.Here I'm running any mixture of: Windows/BSD/Linix Mint LMDE.Windows has the ...W() APIs along with codepage-based APIs with
the ...A() Suffix. The W()-APIs support UTF-16, so no need for
Yes that's correct, but...
- that assumes we know in advance what the character is
- it would only work under Windows
We want portability across diverse OSs. In my case, the program
does NOT care what the character is, it simply needs to be able
to find it when searching data & displaying it in an ordered way.
The code below works perfectly:
#include <stdio.h>
#include <string.h>
int utf8_display_width(const char *s) {
int w = 0;
while (*s) {
unsigned char b = *s;
unsigned cp;
int n;
// UTF-8 decoder
if (b <= 0x7F) { // 1-byte ASCII
cp = b;
n = 1;
} else if (b >= 0xC0 && b <= 0xDF) { // 2-byte
cp = ((b & 0x1F) << 6) |
(s[1] & 0x3F);
n = 2;
} else if (b >= 0xE0 && b <= 0xEF) { // 3-byte
cp = ((b & 0x0F) << 12) |
((s[1] & 0x3F) << 6) |
(s[2] & 0x3F);
n = 3;
} else if (b >= 0xF0 && b <= 0xF7) { // 4-byte
cp = ((b & 0x07) << 18) |
((s[1] & 0x3F) << 12) |
((s[2] & 0x3F) << 6) |
(s[3] & 0x3F);
n = 4;
} else { // invalid, treat as 1-byte
cp = b;
n = 1;
}
// display width
if (cp >= 0x0300 && cp <= 0x036F) {} // combining marks like é (zero width)
else if ( // double-width characters...
(cp >= 0x1100 && cp <= 0x115F) || // hangul jamo
(cp >= 0x2E80 && cp <= 0xA4CF) || // cjk radicals & unified ideographs
(cp >= 0xAC00 && cp <= 0xD7A3) || // hangul syllables
(cp >= 0xF900 && cp <= 0xFAFF) || // cjk compatibility ideographs
(cp >= 0x1F300 && cp <= 0x1FAFF) // emoji + symbols
) { w += 2; }
// exceptional wide characters (unicode requirement I've read elsewhere)
else if (cp == 0x2329 || cp == 0x232A) { w += 2; }
else { w += 1; } // normal width for everything else
s += n;
}
return w;
}
int main(void) {
const char *tests[] = {
"hello",
"Café",
"漢字",
"✓",
"🙂",
NULL
};
// find maximum display width in 1st column
int maxw = 0;
for (int i = 0; tests[i]; i++) {
int w = utf8_display_width(tests[i]);
if (w > maxw) maxw = w;
}
// total padding after each 1st column + 3 spaces
int total_pad = maxw + 3;
for (int i = 0; tests[i]; i++) {
int w = utf8_display_width(tests[i]);
int sl = strlen(tests[i]);
printf("%s", tests[i]);
int pad = total_pad - w;
while (pad-- > 0) putchar(' ');
printf("strlen: %d utf8 display width: %d\n", sl, w);
}
return 0;
}
// eof
I find it confusing that this is supposed to "work perfectly" "acrossCan C handle that with those means given by the standard itself.
diverse OSs". The amount of space that a character takes up varies
depending upon the installed fonts, especially on whether the font is monospaced or proportional. Those fonts can be different for display on screen or on a printer. I don't see any query to determine even what the current font is, much less what it's characteristics are. I don't know
of any OS-independent way of collecting such information. Does this
solution "work perfectly" only for your own particular favorite font?
| Sysop: | DaiTengu |
|---|---|
| Location: | Appleton, WI |
| Users: | 1,089 |
| Nodes: | 10 (0 / 10) |
| Uptime: | 153:54:12 |
| Calls: | 13,921 |
| Calls today: | 2 |
| Files: | 187,021 |
| D/L today: |
3,760 files (944M bytes) |
| Messages: | 2,457,163 |