• Re: Crisis? What Crisis?

    From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Sun Oct 19 01:08:58 2025
    From Newsgroup: comp.arch

    On Sat, 18 Oct 2025 19:24:21 -0000 (UTC), Thomas Koenig wrote:

    [LAPACK] is certainly in use by very many people, if indirectly, for
    example by Python or R.

    Certainly used by NumPy:

    ldo@theon:~> apt-cache depends python3-numpy
    python3-numpy
    ...
    |Depends: libblas3
    Depends: <libblas.so.3>
    libblas3
    libblis4-openmp
    libblis4-pthread
    libblis4-serial
    libopenblas0-openmp
    libopenblas0-pthread
    libopenblas0-serial
    ...
    |Depends: liblapack3
    Depends: <liblapack.so.3>
    liblapack3
    libopenblas0-openmp
    libopenblas0-pthread
    libopenblas0-serial
    ...
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Sun Oct 19 01:11:37 2025
    From Newsgroup: comp.arch

    On Sat, 18 Oct 2025 23:11:38 +0300, Michael S wrote:

    I don't use either of the two for numerics (I use python for other
    tasks). But I use Matlab and Octave. I know for sure that Octave
    uses relatively new implementations, and pretty sure that the same
    goes for Matlab.

    On my system, Octave uses exactly the same version of LAPACK as NumPy
    does:

    ldo@theon:~> apt-cache depends octave
    octave
    ...
    Depends: <libblas.so.3>
    libblas3
    libblis4-openmp
    libblis4-pthread
    libblis4-serial
    libopenblas0-openmp
    libopenblas0-pthread
    libopenblas0-serial
    ...
    |Depends: liblapack3
    Depends: <liblapack.so.3>
    liblapack3
    libopenblas0-openmp
    libopenblas0-pthread
    libopenblas0-serial
    ...
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Sun Oct 19 01:17:19 2025
    From Newsgroup: comp.arch

    On Sat, 18 Oct 2025 10:21:32 +0200, Terje Mathisen wrote:

    MitchAlsup wrote:

    On Fri, 17 Oct 2025 22:20:49 -0000 (UTC), Lawrence D’Oliveiro wrote:

    Short-vector SIMD was introduced along an entirely separate
    evolutionary path, namely that of bringing DSP-style operations
    into general-purpose CPUs.

    MMX was designed to kill off the plug in Modems.

    MMX was quite obviously (also) intended for short vectors of
    typically 8 and 16-bit elements, it was the enabler for sw DVD
    decoding. ZoranDVD was the first to properly handle 30 frames/second
    with zero skips, it needed a PentiumMMX-200 to do so.

    I think the initial “killer app” for short-vector SIMD was very much
    video encoding/decoding, not audio encoding/decoding. Audio was
    already easy enough to manage with general-purpose CPUs of the 1990s.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Sun Oct 19 01:20:16 2025
    From Newsgroup: comp.arch

    On Sat, 18 Oct 2025 22:22:32 -0000 (UTC), Waldek Hebisch wrote:

    In many cases one can enlarge data structures to multiple of SIMD vector
    size (and align them properly). There requires some extra code, but mot
    too much and all of it is outside inner loop. So, there is some waste,
    but rather small due to unused elements.

    Of course, there is still trouble due to different SIMD vector size
    and/or different SIMD instructions sets.

    Just so long as you keep such optimized data structures *internal* to the program, and don’t make them part of any public interchange format!

    Interchange formats tend to outlive the original technological milieu they were created in, and decisions made for the sake of technical limitations
    of the time can end up looking rather ... anachronistic ... just a few
    years down the track.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Oct 19 01:56:03 2025
    From Newsgroup: comp.arch

    On 10/18/2025 10:16 AM, David Brown wrote:
    On 18/10/2025 03:05, Lawrence D’Oliveiro wrote:
    On Sat, 18 Oct 2025 00:42:27 GMT, MitchAlsup wrote:

    On Fri, 17 Oct 2025 22:20:49 -0000 (UTC), Lawrence D’Oliveiro wrote:

    First of all, we have some “HDR” monitors around now that can output a >>>> much greater gradation of brightness levels. These can be used to
    produce apparent brightnesses greater than 100%.

    It is unlikely that monitors will ever get much beyond 11-bits of pixel
    depth per color.

    I think bragging rights alone will see it grow beyond that. Look at
    tandem
    OLEDs.


    Like many things, human perception of brightness is not linear - it is somewhat logarithmic.  So even though we might not be able to
    distinguish anywhere close to 2000 different nuances of one primary
    colour, we /can/ perceive a very wide dynamic range.  Having a large
    number of bits on a linear scale can be more convenient in practice than trying to get accurate non-linear scaling.


    Possible, but it is a question if high bit depth would make much
    difference. We are still in a case where HDMI usually sends 8 or
    sometimes 10 bits per channel, but displays are generally limited to 5
    or 6 bits (and may then dither stuff on the display side).


    Then we have:
    Traditional LCD: Uses a fluorescent backlight;
    LED: Typically LCD + LED backlights;
    OLED: Panel itself uses LEDs
    Typically much more expensive;
    Notoriously short lifespan.

    I have a display, LED+LCD tech, it has an HDR mode, but it isn't great.
    As noted, it seems like it mostly turns up the brightness and uses image processing wonk (which adds a bunch of artifacts).

    And, if I wanted 25% brighter, I could turn the brightness setting from
    40 to 50 or similar (checks, current settings being 40% brightness, 60% contrast).



    Then, we have HDR in 3D rendering which is, as noted, not usually about
    the monitor, but about using floating-point for rendering (typically
    with LDR for the final output).

    Often it still makes sense to use LDR for textures, but then HDR for the framebuffer (since the HDR is usually more a product of the lighting
    than the materials).

    Binary16 is plenty of precision for framebuffer.
    Though, often FP8U (E4.M4) is likely to still be acceptable.

    Where:
    E3.M5: Not really enough dynamic range.
    E4.M4: OK (Comparable to RGB555)
    E5.M3: Image quality is poor (worse than RGB555).

    We usually give up sign with smaller formats, assuming that any values
    which would go negative are clamped to 0, as it is harder in this case
    to justify spending a bit on being able to represent negative colors.

    For native Binary16, may as well allow negatives.



    There is a question of the best way to store HDR images:
    4x FP16: High quality, but expensive
    4x FP8U: More affordable, can do RGBA
    RGB8_E8: good for opaque images, works OK.
    RGB8_EA4: OK, non-standard.
    RGB9_E5: Good for opaque images
    RG11_B10: E5.M6 | E5.M5

    For files, currently ignoring EXR, but this is typically similar tech to
    the TGA format in most cases (raw floats, or maybe with RLE, very
    bulky). There are other options, but when I encountered EXR images in
    the past, they were being used basically like the TGA format.


    For a format like my UPIC design, could likely (in theory) handle
    components of up to around 14 bits. Problem becomes the range of
    quantizer values, where at high bit-depths an 8-bit quantization table
    value may be no longer sufficient.

    In this case, the limiting factor is that A-B needs to stay within int16
    range (both the internal buffers and coefficient encoding maxes out at
    int16 range).

    For T.81 JPEG, there are a rarely used variants that have 10 and 12 bit components (where, JPEG has a lot of the same basic issues here).
    Though, a lot of what people assume are the limits of T.81 JPEG, are
    actually the limits of JFIF.


    With either format, using 12 bits makes sense, as this isn't too far
    outside the range of the 8-bit quantization values (mostly sets a limit
    to how low of quality 0% can achieve; though likely does mean likely
    scaling the quantizer values by 8x vs whatever they would be for that
    quality level with LDR, and clamping them between 1 and 255).


    So, one possibility could be, say:
    Image can represent values as 12 bits: E5.M7

    Or, maybe allow negative components as well, likely in ones' complement
    form. Though, this would be unusual if using JPEG as a base as they tend
    not to use negative components even if nothing in the design of the
    format necessarily prevents the use of negative components.

    Depending on needs, could be decoded as Binary16 or as one of the other formats.

    Though, another option is to just store the images with 8-bit E4.M4
    components (so, from the codec's POV, it is the same as with an LDR image).



    Then again, someone might want lossless Binary16, but my UPIC format
    couldn't do this as-is, since doing so would exceed current value ranges.

    I would likely need to hack the VLC scheme to allow for larger coefficients.

    As-is, table looks like (V prefix, extra bits, unsigned range)
    0/ 1, 0, 0.. 1 2/ 3, 0, 2.. 3
    4/ 5, 1, 4.. 7 6/ 7, 2, 8.. 15
    8/ 9, 3, 16.. 31 10/11, 4, 32.. 63
    12/13, 5, 64.. 127 14/15, 6, 128.. 255
    16/17, 7, 256.. 511 18/19, 8, 512.. 1023
    20/21, 9, 1024.. 2047 22/23, 10, 2048.. 4095
    24/25, 11, 4096.. 8191 26/27, 12, 8192..16383
    28/29, 13, 16384..32767 30/31, 14, 32768..65535

    So, with the zigzag folding, this expresses a 16-bit range.

    Both the Block-Haar and RCT effectively cost 1 bit of dynamic range,
    meaning as-is, leaving the widest allowed component as 14-bits (signed
    range).

    Though, one possibility would be hacking the upper end of the table (not otherwise used for LDR images) to use a steeper step with a 16-bit
    components range, say:
    24, 12, 4096.. 8191
    25, 13, 8192.. 16383
    26, 14, 16384.. 32767
    27, 15, 32768.. 65536
    28, 16, 65536.. 131071
    29, 17, 131072.. 262143
    30, 18, 262144.. 524287
    31, 19, 524288..1048575

    Which (if using 32-bits for transform coefficients) would exceed the
    dynamic range needed for 16-bit coefficients (roughly +/- 262144 if unbalanced).

    Might need to define a special case for 16-bit quantization tables to
    allow for effective lossy compression though. Most naive option is that,
    if the quantization table has 128 bytes of payload (vs 64) it is assumed
    to use 16-bit components.


    Well, and then one can debate whether RCT, Haar, etc, are still the best options. Well, and (if 12 bit components were used), how the VLC scheme
    would be understood (or if Binary16 would effectively preclude such a
    12-bit encoding scheme as redundant).


    May or may not have a use-case for such a thing, TBD.

    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Oct 19 07:55:57 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> schrieb:
    On Sat, 18 Oct 2025 19:24:21 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Michael S <already5chosen@yahoo.com> schrieb:

    It is possible that LAPAC API was not updated in decades,

    The API of existing LAPACK routines was not changed (AFAIK),
    but there were certainly additions. It is also possible to chose
    64-bit integers at build time.

    although I'd
    expect that even at API level there were at least small additions,
    if not changes. But if you are right that LAPAC implementation was
    not updated in decade than you could be sure that it is either not
    used by anybody or used by very few people.

    It is certainly in use by very many people, if indirectly, for example
    by Python or R.

    Does Python (numpy and scipy, I suppose) or R linked against
    implementation of LAPACK from 40 or 30 years ago, as suggested by Mitch?

    No, they don't (as I learned). They would cut themselves off
    from all the improvements and bug fixes since then.

    Somehow, I don't believe it.
    I don't use either of the two for numerics (I use python for other
    tasks). But I use Matlab and Octave. I know for sure that Octave uses relatively new implementations, and pretty sure that the same goes
    for Matlab.

    I would be surprised otherwise.

    Personally, when I need LAPAC-like functionality then I tend to use
    BLAS routines either from Intel MKL or from OpenBLAS.

    Different level of application. You use LAPACK when you want to do
    things like calculating eigenvalues or singular value decomposition,
    see https://www.netlib.org/lapack/lug/node19.html . If you use
    BLAS directly, you might want to check if there is a routine
    in LAPACK which does what you need to do.

    Higher-level algos I am interested in are mostly our own inventions.
    I can look, of course, but the chances that they are present in LAPACK
    are very low.
    In fact, Even BLAS L3 I don't use all that often (and lower levels
    of BLAS never).
    Not because APIs do not match my needs. They typpically do. But
    because standard implementations are optimized for big or huge matrices.
    My needs are medium matrices. A lot of medium matrices.
    My own implementations of standard algorithms for medium-sized
    matrices, most importantly of Cholesky decomposition, tend to be much
    faster than those in OTS BLAS librares. And preparatioon of my own
    didn't take a lot of time. After all those are simple algorithms.

    For the same reason, I implemented unrolling of MATMUL for small
    matrices in gfortran a few years ago. If all you are doing are
    small matrices (especially of constant size), the compiler can
    do a better job from straight loop. By the time the optimized
    matmul routines have started up their machinery, the calculation
    is already done.

    I had to be careful about benchmarking, though. I had to hide the
    fact that I was not actually using the results from the compiler,
    otherwise I got extremely fast execution times for what was
    essentially a no-op. My standard method now is to select a pair
    of array indices where the compiler cannot see them (read from
    a string) and then write out a single element at that position,
    also to a string.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sun Oct 19 16:52:12 2025
    From Newsgroup: comp.arch

    Lawrence D’Oliveiro wrote:
    Speaking of Cray, the US Mint are issuing some new $1 coins featuring
    various famous persons/things, and one of them has a depiction of the
    Cray-1 on it.

    From the photo I’ve seen, it’s an overhead view, looking like a
    stylized letter C. So I wonder, even with the accompanying legend “CRAY-1 SUPERCOMPUTER”, how many people will realize that’s actually a
    picture of the computer?

    <https://www.tomshardware.com/tech-industry/new-us-usd1-coins-to-feature-steve-jobs-and-cray-1-supercomputer-us-mints-2026-american-innovation-program-to-memorialize-computing-history>
    My guess: Well below 0.1% unless they get told what it is.
    It was not obvious to me, and I have sat on the Cray bench several
    times, both in Trondheim (in active use at the time) and in the Computer History Museum in Silicon Valley man years later. (Maybe the latter is a faulty recollection, and I only got to look at it at that time? It was
    during a private showing of the collection.)
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Oct 19 19:31:50 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    It is unlikely that monitors will ever get much beyond 11-bits of pixel depth per color.

    I do not understand why monitor would go beyond 9-bits. Most people
    can't see beyond 7 or 8-bits color component depth. Keeping the
    component depth 10-bits or less allows colors to fit into 32-bits.

    My point was that there is a physical limit on how closely one can
    illuminate a colored pixel--and that limit is around 11-bits. Just
    like there is a limit on how good one can make an A/D converter which
    is around 22-bits.

    I did not imply that a person could SEE that fine a granularity, just
    that one could build a screen that had that fine a granularity.

    Bits beyond 8 would be for some sea creatures or viewable with special glasses?

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Oct 19 19:37:03 2025
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    LAPAC has not been updated in decades, yet is as relevant today as
    the first day it was available.

    Lapack's basics have not changed, but it is still actively maintained,
    with errors being fixed and new features added.

    If you look at the most recent major release, you will see that a lot
    is going on: https://www.netlib.org/lapack/lapack-3.12.0.html
    One important thing seems to be changes to 64-bit integers.

    And I love changes like

    - B = BB*CS + DD*SN
    - C = -AA*SN + CC*CS
    + B = ( BB*CS ) + ( DD*SN )
    + C = -( AA*SN ) + ( CC*CS )

    which makes sure that compilers don't emit FMA instructions and
    change rounding (which, apparently, reduced accuracy enormously
    for one routine.

    FFT is sensitive to NOT using FMAC--that is the error across
    butterflies is lower with FMUL FMUL and FADD than FMUL FMAC.
    This has to do with distributing the error evenly whereas FMAC
    makes one of the calculations better.

    (According to the Fortran standard, the compiler has to honor
    parentheses).
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Oct 19 19:42:35 2025
    From Newsgroup: comp.arch


    Michael S <already5chosen@yahoo.com> posted:

    On Fri, 17 Oct 2025 20:54:23 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:


    No, old hammer does not work well. Unless you consider delivering
    5-10% of possible performance as "working well".

    Are you suggesting that a brand new #3 ball peen hammer is usefully
    better than a 30 YO #3 ball peen hammer ???
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From George Neuner@gneuner2@comcast.net to comp.arch on Sun Oct 19 18:07:10 2025
    From Newsgroup: comp.arch

    On Sun, 19 Oct 2025 19:42:35 GMT, MitchAlsup
    <user5857@newsgrouper.org.invalid> wrote:


    Michael S <already5chosen@yahoo.com> posted:

    On Fri, 17 Oct 2025 20:54:23 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:


    No, old hammer does not work well. Unless you consider delivering
    5-10% of possible performance as "working well".

    Are you suggesting that a brand new #3 ball peen hammer is usefully
    better than a 30 YO #3 ball peen hammer ???

    With repeated use hammers become brittle. A 30yo hammer is more likely
    to crack and/or chip than is a new one.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Mon Oct 20 08:57:42 2025
    From Newsgroup: comp.arch

    On 19/10/2025 03:17, Lawrence D’Oliveiro wrote:
    On Sat, 18 Oct 2025 10:21:32 +0200, Terje Mathisen wrote:

    MitchAlsup wrote:

    On Fri, 17 Oct 2025 22:20:49 -0000 (UTC), Lawrence D’Oliveiro wrote:

    Short-vector SIMD was introduced along an entirely separate
    evolutionary path, namely that of bringing DSP-style operations
    into general-purpose CPUs.

    MMX was designed to kill off the plug in Modems.

    MMX was quite obviously (also) intended for short vectors of
    typically 8 and 16-bit elements, it was the enabler for sw DVD
    decoding. ZoranDVD was the first to properly handle 30 frames/second
    with zero skips, it needed a PentiumMMX-200 to do so.

    I think the initial “killer app” for short-vector SIMD was very much video encoding/decoding, not audio encoding/decoding. Audio was
    already easy enough to manage with general-purpose CPUs of the 1990s.

    Agreed. But having SIMD made audio processing more efficient, which was
    a nice bonus - especially if you wanted more than CD quality audio.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Mon Oct 20 11:06:08 2025
    From Newsgroup: comp.arch

    David Brown wrote:
    On 19/10/2025 03:17, Lawrence D’Oliveiro wrote:
    On Sat, 18 Oct 2025 10:21:32 +0200, Terje Mathisen wrote:

    MitchAlsup wrote:

    On Fri, 17 Oct 2025 22:20:49 -0000 (UTC), Lawrence D’Oliveiro wrote:

    Short-vector SIMD was introduced along an entirely separate
    evolutionary path, namely that of bringing DSP-style operations
    into general-purpose CPUs.

    MMX was designed to kill off the plug in Modems.

    MMX was quite obviously (also) intended for short vectors of
    typically 8 and 16-bit elements, it was the enabler for sw DVD
    decoding. ZoranDVD was the first to properly handle 30 frames/second
    with zero skips, it needed a PentiumMMX-200 to do so.

    I think the initial “killer app” for short-vector SIMD was very much
    video encoding/decoding, not audio encoding/decoding. Audio was
    already easy enough to manage with general-purpose CPUs of the 1990s.

    Agreed.  But having SIMD made audio processing more efficient, which was
    a nice bonus - especially if you wanted more than CD quality audio.
    Having SIMD available was a key part of making the open source Ogg
    Vorbis decoder 3x faster.
    It worked on MMX/SSE/SSE2/Altivec.
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Mon Oct 20 14:21:14 2025
    From Newsgroup: comp.arch

    On 10/20/2025 4:06 AM, Terje Mathisen wrote:
    David Brown wrote:
    On 19/10/2025 03:17, Lawrence D’Oliveiro wrote:
    On Sat, 18 Oct 2025 10:21:32 +0200, Terje Mathisen wrote:

    MitchAlsup wrote:

    On Fri, 17 Oct 2025 22:20:49 -0000 (UTC), Lawrence D’Oliveiro wrote:

    Short-vector SIMD was introduced along an entirely separate
    evolutionary path, namely that of bringing DSP-style operations
    into general-purpose CPUs.

    MMX was designed to kill off the plug in Modems.

    MMX was quite obviously (also) intended for short vectors of
    typically 8 and 16-bit elements, it was the enabler for sw DVD
    decoding. ZoranDVD was the first to properly handle 30 frames/second
    with zero skips, it needed a PentiumMMX-200 to do so.

    I think the initial “killer app” for short-vector SIMD was very much
    video encoding/decoding, not audio encoding/decoding. Audio was
    already easy enough to manage with general-purpose CPUs of the 1990s.

    Agreed.  But having SIMD made audio processing more efficient, which
    was a nice bonus - especially if you wanted more than CD quality audio.

    Having SIMD available was a key part of making the open source Ogg
    Vorbis decoder 3x faster.

    It worked on MMX/SSE/SSE2/Altivec.


    Yeah. Audio is fun...


    But MP3 and Vorbis have the odd property of either sounding really good
    (at high bitrates) or terrible (at lower bitrates, particularly if used
    for something with variable playback speed).

    Seems to be a general issue with audio codecs built from a similar sort
    of block-transform approach (such as MDCT or WHT).


    In some of my own experiments in a similar area, I had used WHT, but
    didn't get quite so good of results. One problem seems to be that there
    is a sort of big issue with frequencies near the block-size, which
    result in nasty artifacts. The overlapping blocks and windowing of MDCT
    reduce this issue, but as noted, MDCT has a high computational cost (vs
    Haar or WHT).

    have yet to come up with something in this category that gives
    satisfactory results (cheap, simple, effective, and passable quality).


    Can also note: ADPCM works OK.

    Can get better results IMO at bitrates lower than where MP3 or Vorbis
    are effective.

    Near the lower end:
    16kHz 2-bit ADPCM: OK, 32kbps
    11kHz 2-bit ADPCM: meh, 22kbps
    8kHz 4-bit ADPCM: Weak, 32kbps
    8kHz 2-bit ADPCM: poor, 16kbps


    Getting OK results at 2-bits/sample requires a different approach from
    what works well at 4 bits, namely rather than encoding one sample at a
    time, it is usually needed to encode a block of samples at a time and
    then search the entire possibility space. Trying to encode samples one
    at a time gives poor results. This makes 2-bit encoding slower and more complicated than 4-bit encoding (but decoder can still be fast).

    As noted, ADPCM proper does not work below 2 bits/sample.

    The added accuracy of 4-bit samples is not an advantage in this case
    since the reduction in sample rate has a more obvious negative impact here.


    After trying a few experiments, the current front-runner for going lower is: Encode a group of 8 or 16 samples as an 8-bit index into a table of
    patterns (such as groups of 2-bit ADPCM samples);
    This can achieve 1.0 or 0.5 bits/sample.

    Have yet to get anything with particularly acceptable audio quality though.

    Did end up resorting to using genetic algorithms for building the
    pattern tables for these experiments. I did previously experiment with
    an interpolation pattern table, but this gave worse results.


    One other line of experimentation was trying to fudge the ADPCM encoding algorithm to preferentially try to generate repeating patterns over
    novel ones with the aim of making it more compressible with LZ77.

    However, it was difficult to significantly improve LZ compressibility
    while still maintaining some semblance of audio quality. Neither
    byte-oriented LZ (eg, LZ4) not Deflate, was particularly effective.


    Did note however that both LZMA and an LZMA style bitwise range encoder
    were much more effective (particularly with 12 or 16 bits of context).

    However, a range encoder is near the upper end of computational
    feasibility (and using a range encoder to squeeze bits out of ADPCM
    seems kinda absurd).


    One intermediate option seems to be a permutation transform. This can
    make the data more amendable to STF+AdRice or Huffman.

    Say, a 2-bit permutation is transform possible (though, in this case one
    can represent every permutation as a 5-bit finite state machine, stored
    as bytes in RAM for convenience). This does have the nice property that
    one can use an 8 bit table lookup for each context which then produces 2
    bits of output at a time.

    Say:
    hist: 8 bits of history
    ival: input, 4x 2-bits
    oval: output, 4x 2-bits, permuted

    px1=permstate[hist];
    ix=((ival>>0)&0x03);
    px2=permupdtab[(px1&0xFC)|ix];
    permstate[hist]=px2;
    hist=(hist<<2)|ix;
    oval=px2&3;

    px1=permstate[hist];
    ix=((ival>>2)&0x03);
    px2=permupdtab[(px1&0xFC)|ix];
    permstate[hist]=px2;
    hist=(hist<<2)|ix;
    oval=oval|((px2&3)<<2);
    ...

    Decoding process is similar

    One downside of this is that they are still about as slow as using the
    bitwise range-coder would have been.


    Also, still doesn't really allow breaking into sub 10 kbps territory
    without a loss of quality. The use of pattern tables allows breaking
    into this territory with a similar loss of quality, and at a lower computational cost.

    Though, it seems possible that the permutation transform could be
    directly integrated with the ADPCM decoder (in effect turning it into
    more of a predictive transform); still wouldn't do much for speed, but
    alas. Would also still need an entropy coder to make use of this.



    One other route seems to be sinewave synthesis, say:
    Pick the top 4 sine waves via some strategy;
    Encode the frequency and amplitude (needs ~ 16 bits IME);
    Do this ~ 100-128 times per second.
    100Hz seems to be a lower limit for intelligibility.

    This needs ~ 6.4 to 8.2 kbps, or 7.2 to 9.2 kbps if one also includes a
    byte to encode a white noise intensity.

    I had best results by taking the space from 2 to 8 kHz, dividing them
    into ~ 1/3 octaves, picking the strongest wave from each group, and then picking the top 4 strongest waves. Worked better for me to ignore lower frequencies (low frequencies seem to contain a lot of louder wave-forms,
    but which contribute little to intelligibility). In this case, waves
    between 2 and 4 kHz tend to dominate.

    Works OK for speech, but is poor for non-speech audio.
    Quality can be improved by more waves, but this quickly eats any bitrate advantage.
    Can note that while called sinewave synthesis, I also got good results
    with 3-state waves (-1, 0, 1), which are computationally preferable (wave-shape is: 1,0,-1,0).

    Can note that when used for non-speech, sinewave synthesis can have
    similar artifacts to low bitrate MP3.

    Could be pushed to lower update rates and maybe could make sense for
    basic songs (say, as a possible alternative to MIDI; which is arguably a somewhat more complex technology).

    Though, can note that for some older systems, sound effects were stored
    as variable-frequency square waves (say, for example, updating the
    square-wave frequency at 18 Hz or similar, with each frequency stored as
    a 16-bit clock-divider value or similar); along with some use of
    Delta-Sigma audio (where low-frequency delta-sigma sounds terrible).
    Neither are particularly good though.


    Though, for general audio storage (such as sound effects), some sort of
    ADPCM variant still seems preferable here.

    Though, still not yet found anything that is clearly beating 2 bit ADPCM
    for this (seemingly a still a good option for sound effects).

    And, as noted, could still get good results with ADPCM + LZMA (or
    similar), main issue being the high computational cost of the latter.

    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Fri Oct 24 04:10:03 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> wrote:
    On Fri, 17 Oct 2025 20:54:23 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    George Neuner <gneuner2@comcast.net> posted:



    Hope the attributions are correct.


    On Wed, 15 Oct 2025 22:31:32 GMT, MitchAlsup
    <user5857@newsgrouper.org.invalid> wrote:


    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:

    On Wed, 15 Oct 2025 05:55:40 GMT, Anton Ertl wrote:

    :
    In any case, even with these languages there are still
    software projects that fail, miss their deadlines and have
    overrun their budget ...

    A lot of these projects were unnecessary. Once someone figured out
    how to make the (17 kinds of) hammers one needs, there it little
    need to make a new hammer architecture.

    Windows could have stopped at W7, and many MANY people would have
    been happier... The mouse was more precise in W7 than in W8 ...
    With a little upgrade for new PCIe architecture along the way
    rather than redesigning whole kit and caboodle for tablets and
    phones which did not work BTW...

    Office application work COULD have STOPPED in 2003, eXcel in 1998,
    ... and few people would have cared. Many SW projects are driven
    not by demand for the product, but pushed by companies to make
    already satisfied users have to upgrade.

    Those programmers could have transitioned to new SW projects
    rather than redesigning the same old thing 8 more times. Presto,
    there is now enough well trained SW engineers to tackle the undone
    SW backlog.

    The problem is that decades of "New & Improved" consumer products
    have conditioned the public to expect innovation (at minimum new
    packaging and/or advertising) every so often.

    Bringing it back to computers: consider that a FOSS library which
    hasn't seen an update for 2 years likely would be passed over by
    many current developers due to concern that the project has been
    abandoned. That perception likely would not change even if the
    author(s) responded to inquiries, the library was suitable "as is"
    for the intended use, and the lack of recent updates can be
    explained entirely by a lack of new bug reports.

    LAPAC has not been updated in decades, yet is as relevant today as
    the first day it was available.


    It is possible that LAPAC API was not updated in decades, although I'd
    expect that even at API level there were at least small additions, if
    not changes. But if you are right that LAPAC implementation was not
    updated in decade than you could be sure that it is either not used by anybody or used by very few people.

    AFAICS at logical level interface stays the same. There is significant
    change: in old times you were on your own trying to interface
    Lapack from C. Now you can get C interface.

    Concerning implementation, AFAICS there are changes. Some
    improvemnts to accuracy, some to speed. But bulk of code
    stays the same. There is a lot of work on lower layer, that
    is BLAS. But the idea of Lapack was that higher level algorithms
    are portable (also in time), while lower level building blocks
    must be adapted to changing computing environment.

    There were attempt to replace Lapack by C++ templates, I do not
    see this gaining traction. There were attempts to extent Lapack
    to larger class of matrices (mostly sparse matrices), apparently
    this is less popular than lapack.

    There are attempts to automatically convert simple high level
    description of operations into high performance code. IIUC
    this has some success with FFT and few similar things, but
    currently is unable to replace Lapack.

    I would say the following: if you have good algorithm, this
    algorithm may live long. Sometimes better things are invented
    later, but if not, then old algorithm may be used quite long.
    Goal of algorithmic languages was to make portable implementation
    of algorithms. That works reasonably well, but if one aims
    at highest possible speed, then needed tweaks freqiently are
    machine specific, so good performance may be nonportable.
    In case of Lapack, it seems that there are no better algorithms
    now compared to time when Lapack was created. Performance of
    Lapack on larger matrices depends mostly on performace of
    BLAS, so there is a lot of current work on BLAS. IIUC sometimes
    Lapack routines are replaced by better performing versions,
    but most of the time gain is too small to justify the effort.

    Concerning "being used by few people": there are codes which
    are sold to a lot of users were performance or features
    matter a lot, such codes tend to evolve quickly. More
    typical is growth by adding new parts: old parts are kept
    with small changes, but new things are build on it (and
    new things independent of old thing are added). There is
    also popular "copy and mutate" approach: some parts are
    copied and them modified to provide different function
    (examples of this are drivers in an OS or new frontends
    in a compiler). However, this is partially weakness of
    programming language (it would be nicer to have clearly
    specified common part and concise specification of
    differences needed for various cases). Partly this is
    messy nature of real world. Lapack is a happly case
    when problem was quite well specified and language
    was reasonable fit for the problem. They use textual
    substitution to produce real and complex variants
    for single and double precision, so in principle
    language could do more. And certainly one could wish
    nicer and more compact description of the algorithms.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Oct 24 05:56:08 2025
    From Newsgroup: comp.arch

    Waldek Hebisch <antispam@fricas.org> schrieb:

    AFAICS at logical level interface stays the same. There is significant change: in old times you were on your own trying to interface
    Lapack from C. Now you can get C interface.

    And they got that wrong (by which I was personally bitten).
    See https://lwn.net/Articles/791393/ for a good write-up.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2