Forum: War Ensemble BBS

Re: Crisis? What Crisis?

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Sun Oct 19 01:08:58 2025

From Newsgroup: comp.arch

On Sat, 18 Oct 2025 19:24:21 -0000 (UTC), Thomas Koenig wrote:

[LAPACK] is certainly in use by very many people, if indirectly, for
example by Python or R.

Certainly used by NumPy:

ldo@theon:~> apt-cache depends python3-numpy
python3-numpy
...
|Depends: libblas3
Depends: <libblas.so.3>
libblas3
libblis4-openmp
libblis4-pthread
libblis4-serial
libopenblas0-openmp
libopenblas0-pthread
libopenblas0-serial
...
|Depends: liblapack3
Depends: <liblapack.so.3>
liblapack3
libopenblas0-openmp
libopenblas0-pthread
libopenblas0-serial
...
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Sun Oct 19 01:11:37 2025

From Newsgroup: comp.arch

On Sat, 18 Oct 2025 23:11:38 +0300, Michael S wrote:

I don't use either of the two for numerics (I use python for other
tasks). But I use Matlab and Octave. I know for sure that Octave
uses relatively new implementations, and pretty sure that the same
goes for Matlab.

On my system, Octave uses exactly the same version of LAPACK as NumPy
does:

ldo@theon:~> apt-cache depends octave
octave
...
Depends: <libblas.so.3>
libblas3
libblis4-openmp
libblis4-pthread
libblis4-serial
libopenblas0-openmp
libopenblas0-pthread
libopenblas0-serial
...
|Depends: liblapack3
Depends: <liblapack.so.3>
liblapack3
libopenblas0-openmp
libopenblas0-pthread
libopenblas0-serial
...
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Sun Oct 19 01:17:19 2025

From Newsgroup: comp.arch

On Sat, 18 Oct 2025 10:21:32 +0200, Terje Mathisen wrote:

MitchAlsup wrote:

On Fri, 17 Oct 2025 22:20:49 -0000 (UTC), Lawrence D’Oliveiro wrote:

Short-vector SIMD was introduced along an entirely separate
evolutionary path, namely that of bringing DSP-style operations
into general-purpose CPUs.

MMX was designed to kill off the plug in Modems.

MMX was quite obviously (also) intended for short vectors of
typically 8 and 16-bit elements, it was the enabler for sw DVD
decoding. ZoranDVD was the first to properly handle 30 frames/second
with zero skips, it needed a PentiumMMX-200 to do so.

I think the initial “killer app” for short-vector SIMD was very much
video encoding/decoding, not audio encoding/decoding. Audio was
already easy enough to manage with general-purpose CPUs of the 1990s.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Sun Oct 19 01:20:16 2025

From Newsgroup: comp.arch

On Sat, 18 Oct 2025 22:22:32 -0000 (UTC), Waldek Hebisch wrote:

In many cases one can enlarge data structures to multiple of SIMD vector
size (and align them properly). There requires some extra code, but mot
too much and all of it is outside inner loop. So, there is some waste,
but rather small due to unused elements.

Of course, there is still trouble due to different SIMD vector size
and/or different SIMD instructions sets.

Just so long as you keep such optimized data structures *internal* to the program, and don’t make them part of any public interchange format!

Interchange formats tend to outlive the original technological milieu they were created in, and decisions made for the sake of technical limitations
of the time can end up looking rather ... anachronistic ... just a few
years down the track.
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sun Oct 19 01:56:03 2025

From Newsgroup: comp.arch

On 10/18/2025 10:16 AM, David Brown wrote:

On 18/10/2025 03:05, Lawrence D’Oliveiro wrote:

On Sat, 18 Oct 2025 00:42:27 GMT, MitchAlsup wrote:

On Fri, 17 Oct 2025 22:20:49 -0000 (UTC), Lawrence D’Oliveiro wrote:

First of all, we have some “HDR” monitors around now that can output a >>>> much greater gradation of brightness levels. These can be used to
produce apparent brightnesses greater than 100%.

It is unlikely that monitors will ever get much beyond 11-bits of pixel
depth per color.

I think bragging rights alone will see it grow beyond that. Look at
tandem
OLEDs.

Like many things, human perception of brightness is not linear - it is somewhat logarithmic. So even though we might not be able to
distinguish anywhere close to 2000 different nuances of one primary
colour, we /can/ perceive a very wide dynamic range. Having a large
number of bits on a linear scale can be more convenient in practice than trying to get accurate non-linear scaling.

Possible, but it is a question if high bit depth would make much
difference. We are still in a case where HDMI usually sends 8 or
sometimes 10 bits per channel, but displays are generally limited to 5
or 6 bits (and may then dither stuff on the display side).

Then we have:
Traditional LCD: Uses a fluorescent backlight;
LED: Typically LCD + LED backlights;
OLED: Panel itself uses LEDs
Typically much more expensive;
Notoriously short lifespan.

I have a display, LED+LCD tech, it has an HDR mode, but it isn't great.
As noted, it seems like it mostly turns up the brightness and uses image processing wonk (which adds a bunch of artifacts).

And, if I wanted 25% brighter, I could turn the brightness setting from
40 to 50 or similar (checks, current settings being 40% brightness, 60% contrast).

Then, we have HDR in 3D rendering which is, as noted, not usually about
the monitor, but about using floating-point for rendering (typically
with LDR for the final output).

Often it still makes sense to use LDR for textures, but then HDR for the framebuffer (since the HDR is usually more a product of the lighting
than the materials).

Binary16 is plenty of precision for framebuffer.
Though, often FP8U (E4.M4) is likely to still be acceptable.

Where:
E3.M5: Not really enough dynamic range.
E4.M4: OK (Comparable to RGB555)
E5.M3: Image quality is poor (worse than RGB555).

We usually give up sign with smaller formats, assuming that any values
which would go negative are clamped to 0, as it is harder in this case
to justify spending a bit on being able to represent negative colors.

For native Binary16, may as well allow negatives.

There is a question of the best way to store HDR images:
4x FP16: High quality, but expensive
4x FP8U: More affordable, can do RGBA
RGB8_E8: good for opaque images, works OK.
RGB8_EA4: OK, non-standard.
RGB9_E5: Good for opaque images
RG11_B10: E5.M6 | E5.M5

For files, currently ignoring EXR, but this is typically similar tech to
the TGA format in most cases (raw floats, or maybe with RLE, very
bulky). There are other options, but when I encountered EXR images in
the past, they were being used basically like the TGA format.

For a format like my UPIC design, could likely (in theory) handle
components of up to around 14 bits. Problem becomes the range of
quantizer values, where at high bit-depths an 8-bit quantization table
value may be no longer sufficient.

In this case, the limiting factor is that A-B needs to stay within int16
range (both the internal buffers and coefficient encoding maxes out at
int16 range).

For T.81 JPEG, there are a rarely used variants that have 10 and 12 bit components (where, JPEG has a lot of the same basic issues here).
Though, a lot of what people assume are the limits of T.81 JPEG, are
actually the limits of JFIF.

With either format, using 12 bits makes sense, as this isn't too far
outside the range of the 8-bit quantization values (mostly sets a limit
to how low of quality 0% can achieve; though likely does mean likely
scaling the quantizer values by 8x vs whatever they would be for that
quality level with LDR, and clamping them between 1 and 255).

So, one possibility could be, say:
Image can represent values as 12 bits: E5.M7

Or, maybe allow negative components as well, likely in ones' complement
form. Though, this would be unusual if using JPEG as a base as they tend
not to use negative components even if nothing in the design of the
format necessarily prevents the use of negative components.

Depending on needs, could be decoded as Binary16 or as one of the other formats.

Though, another option is to just store the images with 8-bit E4.M4
components (so, from the codec's POV, it is the same as with an LDR image).

Then again, someone might want lossless Binary16, but my UPIC format
couldn't do this as-is, since doing so would exceed current value ranges.

I would likely need to hack the VLC scheme to allow for larger coefficients.

As-is, table looks like (V prefix, extra bits, unsigned range)
0/ 1, 0, 0.. 1 2/ 3, 0, 2.. 3
4/ 5, 1, 4.. 7 6/ 7, 2, 8.. 15
8/ 9, 3, 16.. 31 10/11, 4, 32.. 63
12/13, 5, 64.. 127 14/15, 6, 128.. 255
16/17, 7, 256.. 511 18/19, 8, 512.. 1023
20/21, 9, 1024.. 2047 22/23, 10, 2048.. 4095
24/25, 11, 4096.. 8191 26/27, 12, 8192..16383
28/29, 13, 16384..32767 30/31, 14, 32768..65535

So, with the zigzag folding, this expresses a 16-bit range.

Both the Block-Haar and RCT effectively cost 1 bit of dynamic range,
meaning as-is, leaving the widest allowed component as 14-bits (signed
range).

Though, one possibility would be hacking the upper end of the table (not otherwise used for LDR images) to use a steeper step with a 16-bit
components range, say:
24, 12, 4096.. 8191
25, 13, 8192.. 16383
26, 14, 16384.. 32767
27, 15, 32768.. 65536
28, 16, 65536.. 131071
29, 17, 131072.. 262143
30, 18, 262144.. 524287
31, 19, 524288..1048575

Which (if using 32-bits for transform coefficients) would exceed the
dynamic range needed for 16-bit coefficients (roughly +/- 262144 if unbalanced).

Might need to define a special case for 16-bit quantization tables to
allow for effective lossy compression though. Most naive option is that,
if the quantization table has 128 bytes of payload (vs 64) it is assumed
to use 16-bit components.

Well, and then one can debate whether RCT, Haar, etc, are still the best options. Well, and (if 12 bit components were used), how the VLC scheme
would be understood (or if Binary16 would effectively preclude such a
12-bit encoding scheme as redundant).

May or may not have a use-case for such a thing, TBD.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Oct 19 07:55:57 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> schrieb:

On Sat, 18 Oct 2025 19:24:21 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Michael S <already5chosen@yahoo.com> schrieb:

It is possible that LAPAC API was not updated in decades,

The API of existing LAPACK routines was not changed (AFAIK),
but there were certainly additions. It is also possible to chose
64-bit integers at build time.

although I'd
expect that even at API level there were at least small additions,
if not changes. But if you are right that LAPAC implementation was
not updated in decade than you could be sure that it is either not
used by anybody or used by very few people.

It is certainly in use by very many people, if indirectly, for example
by Python or R.

Does Python (numpy and scipy, I suppose) or R linked against
implementation of LAPACK from 40 or 30 years ago, as suggested by Mitch?

No, they don't (as I learned). They would cut themselves off
from all the improvements and bug fixes since then.

Somehow, I don't believe it.
I don't use either of the two for numerics (I use python for other
tasks). But I use Matlab and Octave. I know for sure that Octave uses relatively new implementations, and pretty sure that the same goes
for Matlab.

I would be surprised otherwise.

Personally, when I need LAPAC-like functionality then I tend to use
BLAS routines either from Intel MKL or from OpenBLAS.

Different level of application. You use LAPACK when you want to do
things like calculating eigenvalues or singular value decomposition,
see https://www.netlib.org/lapack/lug/node19.html . If you use
BLAS directly, you might want to check if there is a routine
in LAPACK which does what you need to do.

Higher-level algos I am interested in are mostly our own inventions.
I can look, of course, but the chances that they are present in LAPACK
are very low.
In fact, Even BLAS L3 I don't use all that often (and lower levels
of BLAS never).
Not because APIs do not match my needs. They typpically do. But
because standard implementations are optimized for big or huge matrices.
My needs are medium matrices. A lot of medium matrices.
My own implementations of standard algorithms for medium-sized
matrices, most importantly of Cholesky decomposition, tend to be much
faster than those in OTS BLAS librares. And preparatioon of my own
didn't take a lot of time. After all those are simple algorithms.

For the same reason, I implemented unrolling of MATMUL for small
matrices in gfortran a few years ago. If all you are doing are
small matrices (especially of constant size), the compiler can
do a better job from straight loop. By the time the optimized
matmul routines have started up their machinery, the calculation
is already done.

I had to be careful about benchmarking, though. I had to hide the
fact that I was not actually using the results from the compiler,
otherwise I got extremely fast execution times for what was
essentially a no-op. My standard method now is to select a pair
of array indices where the compiler cannot see them (read from
a string) and then write out a single element at that position,
also to a string.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sun Oct 19 16:52:12 2025

From Newsgroup: comp.arch

Lawrence D’Oliveiro wrote:

Speaking of Cray, the US Mint are issuing some new $1 coins featuring
various famous persons/things, and one of them has a depiction of the
Cray-1 on it.

From the photo Iâ€™ve seen, itâ€™s an overhead view, looking like a
stylized letter C. So I wonder, even with the accompanying legend â€œCRAY-1 SUPERCOMPUTERâ€, how many people will realize thatâ€™s actually a
picture of the computer?

<https://www.tomshardware.com/tech-industry/new-us-usd1-coins-to-feature-steve-jobs-and-cray-1-supercomputer-us-mints-2026-american-innovation-program-to-memorialize-computing-history>

My guess: Well below 0.1% unless they get told what it is.
It was not obvious to me, and I have sat on the Cray bench several
times, both in Trondheim (in active use at the time) and in the Computer History Museum in Silicon Valley man years later. (Maybe the latter is a faulty recollection, and I only got to look at it at that time? It was
during a private showing of the collection.)
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Oct 19 19:31:50 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

It is unlikely that monitors will ever get much beyond 11-bits of pixel depth per color.

I do not understand why monitor would go beyond 9-bits. Most people
can't see beyond 7 or 8-bits color component depth. Keeping the
component depth 10-bits or less allows colors to fit into 32-bits.

My point was that there is a physical limit on how closely one can
illuminate a colored pixel--and that limit is around 11-bits. Just
like there is a limit on how good one can make an A/D converter which
is around 22-bits.

I did not imply that a person could SEE that fine a granularity, just
that one could build a screen that had that fine a granularity.

Bits beyond 8 would be for some sea creatures or viewable with special glasses?

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Oct 19 19:37:03 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> posted:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

LAPAC has not been updated in decades, yet is as relevant today as
the first day it was available.

Lapack's basics have not changed, but it is still actively maintained,
with errors being fixed and new features added.

If you look at the most recent major release, you will see that a lot
is going on: https://www.netlib.org/lapack/lapack-3.12.0.html
One important thing seems to be changes to 64-bit integers.

And I love changes like

- B = BB*CS + DD*SN
- C = -AA*SN + CC*CS
+ B = ( BB*CS ) + ( DD*SN )
+ C = -( AA*SN ) + ( CC*CS )

which makes sure that compilers don't emit FMA instructions and
change rounding (which, apparently, reduced accuracy enormously
for one routine.

FFT is sensitive to NOT using FMAC--that is the error across
butterflies is lower with FMUL FMUL and FADD than FMUL FMAC.
This has to do with distributing the error evenly whereas FMAC
makes one of the calculations better.

(According to the Fortran standard, the compiler has to honor
parentheses).

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Oct 19 19:42:35 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> posted:

On Fri, 17 Oct 2025 20:54:23 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

No, old hammer does not work well. Unless you consider delivering
5-10% of possible performance as "working well".

Are you suggesting that a brand new #3 ball peen hammer is usefully
better than a 30 YO #3 ball peen hammer ???
--- Synchronet 3.21a-Linux NewsLink 1.2

From George Neuner@gneuner2@comcast.net to comp.arch on Sun Oct 19 18:07:10 2025

From Newsgroup: comp.arch

On Sun, 19 Oct 2025 19:42:35 GMT, MitchAlsup
<user5857@newsgrouper.org.invalid> wrote:

Michael S <already5chosen@yahoo.com> posted:

On Fri, 17 Oct 2025 20:54:23 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

No, old hammer does not work well. Unless you consider delivering
5-10% of possible performance as "working well".

Are you suggesting that a brand new #3 ball peen hammer is usefully
better than a 30 YO #3 ball peen hammer ???

With repeated use hammers become brittle. A 30yo hammer is more likely
to crack and/or chip than is a new one.
--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Mon Oct 20 08:57:42 2025

From Newsgroup: comp.arch

On 19/10/2025 03:17, Lawrence D’Oliveiro wrote:

On Sat, 18 Oct 2025 10:21:32 +0200, Terje Mathisen wrote:

MitchAlsup wrote:

On Fri, 17 Oct 2025 22:20:49 -0000 (UTC), Lawrence D’Oliveiro wrote:

Short-vector SIMD was introduced along an entirely separate
evolutionary path, namely that of bringing DSP-style operations
into general-purpose CPUs.

MMX was designed to kill off the plug in Modems.

MMX was quite obviously (also) intended for short vectors of
typically 8 and 16-bit elements, it was the enabler for sw DVD
decoding. ZoranDVD was the first to properly handle 30 frames/second
with zero skips, it needed a PentiumMMX-200 to do so.

I think the initial “killer app” for short-vector SIMD was very much video encoding/decoding, not audio encoding/decoding. Audio was
already easy enough to manage with general-purpose CPUs of the 1990s.

Agreed. But having SIMD made audio processing more efficient, which was
a nice bonus - especially if you wanted more than CD quality audio.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Mon Oct 20 11:06:08 2025

From Newsgroup: comp.arch

David Brown wrote:

On 19/10/2025 03:17, Lawrence Dâ€™Oliveiro wrote:

On Sat, 18 Oct 2025 10:21:32 +0200, Terje Mathisen wrote:

MitchAlsup wrote:

On Fri, 17 Oct 2025 22:20:49 -0000 (UTC), Lawrence Dâ€™Oliveiro wrote:

Short-vector SIMD was introduced along an entirely separate
evolutionary path, namely that of bringing DSP-style operations
into general-purpose CPUs.

MMX was designed to kill off the plug in Modems.

MMX was quite obviously (also) intended for short vectors of
typically 8 and 16-bit elements, it was the enabler for sw DVD
decoding. ZoranDVD was the first to properly handle 30 frames/second
with zero skips, it needed a PentiumMMX-200 to do so.

I think the initial â€œkiller appâ€ for short-vector SIMD was very much
video encoding/decoding, not audio encoding/decoding. Audio was
already easy enough to manage with general-purpose CPUs of the 1990s.

Agreed. But having SIMD made audio processing more efficient, which was
a nice bonus - especially if you wanted more than CD quality audio.

Having SIMD available was a key part of making the open source Ogg
Vorbis decoder 3x faster.
It worked on MMX/SSE/SSE2/Altivec.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Mon Oct 20 14:21:14 2025

From Newsgroup: comp.arch

On 10/20/2025 4:06 AM, Terje Mathisen wrote:

David Brown wrote:

On 19/10/2025 03:17, Lawrence Dâ€™Oliveiro wrote:

On Sat, 18 Oct 2025 10:21:32 +0200, Terje Mathisen wrote:

MitchAlsup wrote:

On Fri, 17 Oct 2025 22:20:49 -0000 (UTC), Lawrence Dâ€™Oliveiro wrote:

Short-vector SIMD was introduced along an entirely separate
evolutionary path, namely that of bringing DSP-style operations
into general-purpose CPUs.

MMX was designed to kill off the plug in Modems.

MMX was quite obviously (also) intended for short vectors of
typically 8 and 16-bit elements, it was the enabler for sw DVD
decoding. ZoranDVD was the first to properly handle 30 frames/second
with zero skips, it needed a PentiumMMX-200 to do so.

I think the initial â€œkiller appâ€ for short-vector SIMD was very much
video encoding/decoding, not audio encoding/decoding. Audio was
already easy enough to manage with general-purpose CPUs of the 1990s.

Agreed. But having SIMD made audio processing more efficient, which
was a nice bonus - especially if you wanted more than CD quality audio.

Having SIMD available was a key part of making the open source Ogg
Vorbis decoder 3x faster.

It worked on MMX/SSE/SSE2/Altivec.

Yeah. Audio is fun...

But MP3 and Vorbis have the odd property of either sounding really good
(at high bitrates) or terrible (at lower bitrates, particularly if used
for something with variable playback speed).

Seems to be a general issue with audio codecs built from a similar sort
of block-transform approach (such as MDCT or WHT).

In some of my own experiments in a similar area, I had used WHT, but
didn't get quite so good of results. One problem seems to be that there
is a sort of big issue with frequencies near the block-size, which
result in nasty artifacts. The overlapping blocks and windowing of MDCT
reduce this issue, but as noted, MDCT has a high computational cost (vs
Haar or WHT).

have yet to come up with something in this category that gives
satisfactory results (cheap, simple, effective, and passable quality).

Can also note: ADPCM works OK.

Can get better results IMO at bitrates lower than where MP3 or Vorbis
are effective.

Near the lower end:
16kHz 2-bit ADPCM: OK, 32kbps
11kHz 2-bit ADPCM: meh, 22kbps
8kHz 4-bit ADPCM: Weak, 32kbps
8kHz 2-bit ADPCM: poor, 16kbps

Getting OK results at 2-bits/sample requires a different approach from
what works well at 4 bits, namely rather than encoding one sample at a
time, it is usually needed to encode a block of samples at a time and
then search the entire possibility space. Trying to encode samples one
at a time gives poor results. This makes 2-bit encoding slower and more complicated than 4-bit encoding (but decoder can still be fast).

As noted, ADPCM proper does not work below 2 bits/sample.

The added accuracy of 4-bit samples is not an advantage in this case
since the reduction in sample rate has a more obvious negative impact here.

After trying a few experiments, the current front-runner for going lower is: Encode a group of 8 or 16 samples as an 8-bit index into a table of
patterns (such as groups of 2-bit ADPCM samples);
This can achieve 1.0 or 0.5 bits/sample.

Have yet to get anything with particularly acceptable audio quality though.

Did end up resorting to using genetic algorithms for building the
pattern tables for these experiments. I did previously experiment with
an interpolation pattern table, but this gave worse results.

One other line of experimentation was trying to fudge the ADPCM encoding algorithm to preferentially try to generate repeating patterns over
novel ones with the aim of making it more compressible with LZ77.

However, it was difficult to significantly improve LZ compressibility
while still maintaining some semblance of audio quality. Neither
byte-oriented LZ (eg, LZ4) not Deflate, was particularly effective.

Did note however that both LZMA and an LZMA style bitwise range encoder
were much more effective (particularly with 12 or 16 bits of context).

However, a range encoder is near the upper end of computational
feasibility (and using a range encoder to squeeze bits out of ADPCM
seems kinda absurd).

One intermediate option seems to be a permutation transform. This can
make the data more amendable to STF+AdRice or Huffman.

Say, a 2-bit permutation is transform possible (though, in this case one
can represent every permutation as a 5-bit finite state machine, stored
as bytes in RAM for convenience). This does have the nice property that
one can use an 8 bit table lookup for each context which then produces 2
bits of output at a time.

Say:
hist: 8 bits of history
ival: input, 4x 2-bits
oval: output, 4x 2-bits, permuted

px1=permstate[hist];
ix=((ival>>0)&0x03);
px2=permupdtab[(px1&0xFC)|ix];
permstate[hist]=px2;
hist=(hist<<2)|ix;
oval=px2&3;

px1=permstate[hist];
ix=((ival>>2)&0x03);
px2=permupdtab[(px1&0xFC)|ix];
permstate[hist]=px2;
hist=(hist<<2)|ix;
oval=oval|((px2&3)<<2);
...

Decoding process is similar

One downside of this is that they are still about as slow as using the
bitwise range-coder would have been.

Also, still doesn't really allow breaking into sub 10 kbps territory
without a loss of quality. The use of pattern tables allows breaking
into this territory with a similar loss of quality, and at a lower computational cost.

Though, it seems possible that the permutation transform could be
directly integrated with the ADPCM decoder (in effect turning it into
more of a predictive transform); still wouldn't do much for speed, but
alas. Would also still need an entropy coder to make use of this.

One other route seems to be sinewave synthesis, say:
Pick the top 4 sine waves via some strategy;
Encode the frequency and amplitude (needs ~ 16 bits IME);
Do this ~ 100-128 times per second.
100Hz seems to be a lower limit for intelligibility.

This needs ~ 6.4 to 8.2 kbps, or 7.2 to 9.2 kbps if one also includes a
byte to encode a white noise intensity.

I had best results by taking the space from 2 to 8 kHz, dividing them
into ~ 1/3 octaves, picking the strongest wave from each group, and then picking the top 4 strongest waves. Worked better for me to ignore lower frequencies (low frequencies seem to contain a lot of louder wave-forms,
but which contribute little to intelligibility). In this case, waves
between 2 and 4 kHz tend to dominate.

Works OK for speech, but is poor for non-speech audio.
Quality can be improved by more waves, but this quickly eats any bitrate advantage.
Can note that while called sinewave synthesis, I also got good results
with 3-state waves (-1, 0, 1), which are computationally preferable (wave-shape is: 1,0,-1,0).

Can note that when used for non-speech, sinewave synthesis can have
similar artifacts to low bitrate MP3.

Could be pushed to lower update rates and maybe could make sense for
basic songs (say, as a possible alternative to MIDI; which is arguably a somewhat more complex technology).

Though, can note that for some older systems, sound effects were stored
as variable-frequency square waves (say, for example, updating the
square-wave frequency at 18 Hz or similar, with each frequency stored as
a 16-bit clock-divider value or similar); along with some use of
Delta-Sigma audio (where low-frequency delta-sigma sounds terrible).
Neither are particularly good though.

Though, for general audio storage (such as sound effects), some sort of
ADPCM variant still seems preferable here.

Though, still not yet found anything that is clearly beating 2 bit ADPCM
for this (seemingly a still a good option for sound effects).

And, as noted, could still get good results with ADPCM + LZMA (or
similar), main issue being the high computational cost of the latter.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Fri Oct 24 04:10:03 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> wrote:

On Fri, 17 Oct 2025 20:54:23 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

George Neuner <gneuner2@comcast.net> posted:

Hope the attributions are correct.

On Wed, 15 Oct 2025 22:31:32 GMT, MitchAlsup
<user5857@newsgrouper.org.invalid> wrote:

Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:

On Wed, 15 Oct 2025 05:55:40 GMT, Anton Ertl wrote:

:

In any case, even with these languages there are still
software projects that fail, miss their deadlines and have
overrun their budget ...

A lot of these projects were unnecessary. Once someone figured out
how to make the (17 kinds of) hammers one needs, there it little
need to make a new hammer architecture.

Windows could have stopped at W7, and many MANY people would have
been happier... The mouse was more precise in W7 than in W8 ...
With a little upgrade for new PCIe architecture along the way
rather than redesigning whole kit and caboodle for tablets and
phones which did not work BTW...

Office application work COULD have STOPPED in 2003, eXcel in 1998,
... and few people would have cared. Many SW projects are driven
not by demand for the product, but pushed by companies to make
already satisfied users have to upgrade.

Those programmers could have transitioned to new SW projects
rather than redesigning the same old thing 8 more times. Presto,
there is now enough well trained SW engineers to tackle the undone
SW backlog.

The problem is that decades of "New & Improved" consumer products
have conditioned the public to expect innovation (at minimum new
packaging and/or advertising) every so often.

Bringing it back to computers: consider that a FOSS library which
hasn't seen an update for 2 years likely would be passed over by
many current developers due to concern that the project has been
abandoned. That perception likely would not change even if the
author(s) responded to inquiries, the library was suitable "as is"
for the intended use, and the lack of recent updates can be
explained entirely by a lack of new bug reports.

LAPAC has not been updated in decades, yet is as relevant today as
the first day it was available.

It is possible that LAPAC API was not updated in decades, although I'd
expect that even at API level there were at least small additions, if
not changes. But if you are right that LAPAC implementation was not
updated in decade than you could be sure that it is either not used by anybody or used by very few people.

AFAICS at logical level interface stays the same. There is significant
change: in old times you were on your own trying to interface
Lapack from C. Now you can get C interface.

Concerning implementation, AFAICS there are changes. Some
improvemnts to accuracy, some to speed. But bulk of code
stays the same. There is a lot of work on lower layer, that
is BLAS. But the idea of Lapack was that higher level algorithms
are portable (also in time), while lower level building blocks
must be adapted to changing computing environment.

There were attempt to replace Lapack by C++ templates, I do not
see this gaining traction. There were attempts to extent Lapack
to larger class of matrices (mostly sparse matrices), apparently
this is less popular than lapack.

There are attempts to automatically convert simple high level
description of operations into high performance code. IIUC
this has some success with FFT and few similar things, but
currently is unable to replace Lapack.

I would say the following: if you have good algorithm, this
algorithm may live long. Sometimes better things are invented
later, but if not, then old algorithm may be used quite long.
Goal of algorithmic languages was to make portable implementation
of algorithms. That works reasonably well, but if one aims
at highest possible speed, then needed tweaks freqiently are
machine specific, so good performance may be nonportable.
In case of Lapack, it seems that there are no better algorithms
now compared to time when Lapack was created. Performance of
Lapack on larger matrices depends mostly on performace of
BLAS, so there is a lot of current work on BLAS. IIUC sometimes
Lapack routines are replaced by better performing versions,
but most of the time gain is too small to justify the effort.

Concerning "being used by few people": there are codes which
are sold to a lot of users were performance or features
matter a lot, such codes tend to evolve quickly. More
typical is growth by adding new parts: old parts are kept
with small changes, but new things are build on it (and
new things independent of old thing are added). There is
also popular "copy and mutate" approach: some parts are
copied and them modified to provide different function
(examples of this are drivers in an OS or new frontends
in a compiler). However, this is partially weakness of
programming language (it would be nicer to have clearly
specified common part and concise specification of
differences needed for various cases). Partly this is
messy nature of real world. Lapack is a happly case
when problem was quite well specified and language
was reasonable fit for the problem. They use textual
substitution to produce real and complex variants
for single and double precision, so in principle
language could do more. And certainly one could wish
nicer and more compact description of the algorithms.
--
Waldek Hebisch
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Oct 24 05:56:08 2025

From Newsgroup: comp.arch

Waldek Hebisch <antispam@fricas.org> schrieb:

AFAICS at logical level interface stays the same. There is significant change: in old times you were on your own trying to interface
Lapack from C. Now you can get C interface.

And they got that wrong (by which I was personally bitten).
See https://lwn.net/Articles/791393/ for a good write-up.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

Who's Online
Recent Visitors
- Ptb1970
  Sat Dec 13 17:34:42 2025
  from Wisconsin via Telnet
- Microbot
  Sat Dec 13 17:04:31 2025
  from Moore, Ok via Telnet
- John F Kennedy
  Fri Dec 12 21:48:00 2025
  from Crazyworldbbs.Com:2323 via Telnet
- Microbot
  Fri Dec 12 18:16:00 2025
  from Moore, Ok via Telnet

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,089
Nodes:	10 (0 / 10)
Uptime:	153:51:03
Calls:	13,921
Calls today:	2
Files:	187,021
D/L today:	3,755 files (944M bytes)
Messages:	2,457,163

Re: Crisis? What Crisis?

Who's Online

Recent Visitors

System Info