• [gawk] Handling variants of CSV input data formats

    From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Sun Aug 25 08:00:20 2024
    From Newsgroup: comp.lang.awk

    Myself I'm usually not using CSV format(s), but recently I advertised
    GNU Awk (given that newer versions support CSV data processing) to a
    friend seeking CSV solutions.

    I was quite astonished when I stumbled across a StackOverflow article
    about CSV processing with contemporary versions of GNU Awk and read
    that you are restricted to comma as separator and double quotes to
    enclose strings. The workarounds provided at SO were extremely clumsy.

    Given that using ',', ';', '|' (or other delimiters) and also various
    types of quotes are just a lexical (no functional) difference I wonder
    whether it would be sensible to be able to define them, say, through
    setting a PROCINFO element?

    Janis

    https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awk

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Ed Morton@mortonspam@gmail.com to comp.lang.awk on Mon Aug 26 06:26:26 2024
    From Newsgroup: comp.lang.awk

    On 8/25/2024 1:00 AM, Janis Papanagnou wrote:
    Myself I'm usually not using CSV format(s), but recently I advertised
    GNU Awk (given that newer versions support CSV data processing) to a
    friend seeking CSV solutions.

    I was quite astonished when I stumbled across a StackOverflow article
    about CSV processing with contemporary versions of GNU Awk and read
    that you are restricted to comma as separator and double quotes to
    enclose strings. The workarounds provided at SO were extremely clumsy.

    Given that using ',', ';', '|' (or other delimiters) and also various
    types of quotes are just a lexical (no functional) difference I wonder whether it would be sensible to be able to define them, say, through
    setting a PROCINFO element?

    Janis

    https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awk


    FYI gawk just inherited those behaviors (plus mandatory stripping of the quotes from quoted fields, see https://lists.gnu.org/archive/html/bug-gawk/2023-11/msg00018.html) from Kernighans awk.

    Ed.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Mon Aug 26 14:54:04 2024
    From Newsgroup: comp.lang.awk

    On 26.08.2024 13:26, Ed Morton wrote:
    On 8/25/2024 1:00 AM, Janis Papanagnou wrote:
    Myself I'm usually not using CSV format(s), but recently I advertised
    GNU Awk (given that newer versions support CSV data processing) to a
    friend seeking CSV solutions.

    I was quite astonished when I stumbled across a StackOverflow article
    about CSV processing with contemporary versions of GNU Awk and read
    that you are restricted to comma as separator and double quotes to
    enclose strings. The workarounds provided at SO were extremely clumsy.

    Given that using ',', ';', '|' (or other delimiters) and also various
    types of quotes are just a lexical (no functional) difference I wonder
    whether it would be sensible to be able to define them, say, through
    setting a PROCINFO element?

    Janis

    https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awk



    FYI gawk just inherited those behaviors (plus mandatory stripping of the quotes from quoted fields, see https://lists.gnu.org/archive/html/bug-gawk/2023-11/msg00018.html) from Kernighans awk.

    Thanks.

    My opinion on this is that I wouldn't expect GNU Awk to become a (yet
    another) CSV-processor. It's very convenient to have an easy input of
    CSV data to be processed like other tabular data with Awk. So removal
    of the (outer) quotes, transforming "inner" quotes of fields according
    to the CSV-standard(s), and handling the escape symbol, would serve my expectations. (I don't need CSV-output formatting, but I understand if
    there is such a demand.)

    The (flexible) support for (at least typical) field separators is IMO
    more pressing. Whether it can be supported by PROCINFO[] (as I've [ad
    hoc] written above) or using FS is only a detail.

    But given the current implementation, this error message

    $ awk -F';' --csv '...' data.csv
    awk: warning: assignment to FS/FIELDWIDTHS/FPAT has no effect when using
    --csv

    indicates that there's already some FS consistency logic existing, so
    instead of introducing another PROCINFO attribute it would probably be
    more obvious and appear more consistent to the user to use ("re-use")
    FS for the purpose of defining the CSV field delimiter.

    And we also don't need to explicitly PROCINFO-define the used quotes
    since the quotes could anyway be identified (and handled) implicitly.
    Or are there any issues with data like

    "Hi there!",42,'Hello "world"?',"Ed's post",3.14

    to be provided (in $1..$5) as

    Hi there!,42,Hello "world"?,Ed's post,3.14


    Where I have a general unsettling feeling is when locales influence
    the processing. Personally I have defined "LC_NUMERIC=C.UTF-8" so my
    real numbers use a decimal point for the fraction anyway, but if the
    processed data will have a locale numbers-representation with commas
    then there might be issues (and not only with the CSV commas).

    I'd have liked to provide more concrete information here, but I'm at
    the moment even unable to reproduce Awk's behavior as documented in
    its manual; I've tried the following command with various locales

    $ echo 4,321 | LC_ALL=en_DK.utf-8 gawk '{ print $1 + 1 }'
    -| 5,321

    but always got just 5 as result.

    Janis

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Manuel Collado@mcollado2011@gmail.com to comp.lang.awk on Mon Aug 26 19:01:02 2024
    From Newsgroup: comp.lang.awk

    El 26/8/24 a las 14:54, Janis Papanagnou escribió:

    My opinion on this is that I wouldn't expect GNU Awk to become a (yet another) CSV-processor. It's very convenient to have an easy input of
    CSV data to be processed like other tabular data with Awk. So removal
    of the (outer) quotes, transforming "inner" quotes of fields according
    to the CSV-standard(s), and handling the escape symbol, would serve my expectations. (I don't need CSV-output formatting, but I understand if
    there is such a demand.)

    Perhaps you could try my gawk-csvio pure gawk library. Just include a
    first unconditional rule calling csvimport(...) and the CSV input data
    will be automatically converted to a regular OFS-delimited record ready
    to be processed.

    Please find the library at http://mcollado.z15.es/gawk-extras/. The documentation is available also to be read before downloading.

    HTH

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Tue Aug 27 02:31:18 2024
    From Newsgroup: comp.lang.awk

    On 26.08.2024 19:01, Manuel Collado wrote:
    El 26/8/24 a las 14:54, Janis Papanagnou escribió:

    My opinion on this is that I wouldn't expect GNU Awk to become a (yet
    another) CSV-processor. It's very convenient to have an easy input of
    CSV data to be processed like other tabular data with Awk. So removal
    of the (outer) quotes, transforming "inner" quotes of fields according
    to the CSV-standard(s), and handling the escape symbol, would serve my
    expectations. (I don't need CSV-output formatting, but I understand if
    there is such a demand.)

    Perhaps you could try my gawk-csvio pure gawk library. Just include a
    first unconditional rule calling csvimport(...) and the CSV input data
    will be automatically converted to a regular OFS-delimited record ready
    to be processed.

    Please find the library at http://mcollado.z15.es/gawk-extras/. The documentation is available also to be read before downloading.

    As said upthread: "Myself I'm usually not using CSV format(s), [...]".
    But thanks, I'll forward that information to interested people I know.

    Just to clarify; my above stated opinion was meant to provide a view
    concerning the newly incorporated '--csv' feature to obtain a greater acceptance and not disappointment from people choosing and using GNU
    Awk for that purpose. Given how it got implemented it's (IMO) of very
    limited use (if you cannot control the delimiters and quotes).

    If I understand correctly that the library you mention would address
    the two topics (field separator and quoting) then there's even less a
    point (I suppose) to use the new '--csv' option in GNU Awk; just use
    your library instead?

    Janis

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Ed Morton@mortonspam@gmail.com to comp.lang.awk on Mon Aug 26 19:49:00 2024
    From Newsgroup: comp.lang.awk

    On 8/26/2024 7:54 AM, Janis Papanagnou wrote:
    snip>
    I'd have liked to provide more concrete information here, but I'm at
    the moment even unable to reproduce Awk's behavior as documented in
    its manual; I've tried the following command with various locales

    $ echo 4,321 | LC_ALL=en_DK.utf-8 gawk '{ print $1 + 1 }'
    -| 5,321

    but always got just 5 as result.

    You need to specifically TELL gawk to use your locale to read input numbers:

    $ echo 4,321 | LC_ALL=en_DK.utf-8 gawk '{ print $1 + 1 }'
    5

    $ echo 4,321 | POSIXLY_CORRECT=1 LC_ALL=en_DK.utf-8 gawk '{ print $1 + 1 }' 5,321

    $ echo 4,321 | LC_ALL=en_DK.utf-8 gawk -N '{ print $1 + 1 }'
    5,321

    See https://www.gnu.org/software/gawk/manual/gawk.html#Locale-influences-conversions
    for more info on that.

    Regards,

    Ed
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Tue Aug 27 03:39:07 2024
    From Newsgroup: comp.lang.awk

    On 27.08.2024 02:49, Ed Morton wrote:
    On 8/26/2024 7:54 AM, Janis Papanagnou wrote:
    snip>
    I'd have liked to provide more concrete information here, but I'm at
    the moment even unable to reproduce Awk's behavior as documented in
    its manual; I've tried the following command with various locales

    $ echo 4,321 | LC_ALL=en_DK.utf-8 gawk '{ print $1 + 1 }'
    -| 5,321

    but always got just 5 as result.

    You need to specifically TELL gawk to use your locale to read input
    numbers:

    $ echo 4,321 | LC_ALL=en_DK.utf-8 gawk '{ print $1 + 1 }'
    5

    $ echo 4,321 | POSIXLY_CORRECT=1 LC_ALL=en_DK.utf-8 gawk '{ print $1 + 1 }' 5,321

    $ echo 4,321 | LC_ALL=en_DK.utf-8 gawk -N '{ print $1 + 1 }' 5,321

    See https://www.gnu.org/software/gawk/manual/gawk.html#Locale-influences-conversions
    for more info on that.

    Thanks. That's actually where I got above example from.

    I've missed that there was an explicit
    $ export POSIXLY_CORRECT=1
    set on the very top of these examples. Gee!

    Feels anyway strange that an explicit LC_* setting is ineffective
    without the additional POSIXLY_CORRECT variable. And the page also
    says: "The POSIX standard says that awk always uses the period as
    the decimal point when reading the awk program source code".
    So despite POSIX saying that, you have to use a variable named
    POSIXLY_CORRECT. - Do I need some more coffee to understand that?

    And I see there's an additional GNU Awk option '--use-lc-numeric'.
    What a mess!

    (I suppose current status can only be explained by the mentioned
    forth-and-back during history of various GNU Awk versions.)

    What's worth the LC_* variables if they are ignored (or maybe not).

    Janis


    Regards,

    Ed

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Manuel Collado@mcollado2011@gmail.com to comp.lang.awk on Tue Aug 27 12:20:28 2024
    From Newsgroup: comp.lang.awk

    El 27/8/24 a las 2:31, Janis Papanagnou escribió:

    If I understand correctly that the library you mention would address
    the two topics (field separator and quoting) then there's even less a
    point (I suppose) to use the new '--csv' option in GNU Awk; just use
    your library instead?

    A user decision.

    HTH


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Ed Morton@mortonspam@gmail.com to comp.lang.awk on Tue Aug 27 06:45:32 2024
    From Newsgroup: comp.lang.awk

    On 8/26/2024 8:39 PM, Janis Papanagnou wrote:
    On 27.08.2024 02:49, Ed Morton wrote:
    On 8/26/2024 7:54 AM, Janis Papanagnou wrote:
    snip>
    I'd have liked to provide more concrete information here, but I'm at
    the moment even unable to reproduce Awk's behavior as documented in
    its manual; I've tried the following command with various locales

    $ echo 4,321 | LC_ALL=en_DK.utf-8 gawk '{ print $1 + 1 }'
    -| 5,321

    but always got just 5 as result.

    You need to specifically TELL gawk to use your locale to read input
    numbers:

    $ echo 4,321 | LC_ALL=en_DK.utf-8 gawk '{ print $1 + 1 }'
    5

    $ echo 4,321 | POSIXLY_CORRECT=1 LC_ALL=en_DK.utf-8 gawk '{ print $1 + 1 }' >> 5,321

    $ echo 4,321 | LC_ALL=en_DK.utf-8 gawk -N '{ print $1 + 1 }'
    5,321

    See
    https://www.gnu.org/software/gawk/manual/gawk.html#Locale-influences-conversions
    for more info on that.

    Thanks. That's actually where I got above example from.

    I've missed that there was an explicit
    $ export POSIXLY_CORRECT=1
    set on the very top of these examples. Gee!

    Feels anyway strange that an explicit LC_* setting is ineffective
    without the additional POSIXLY_CORRECT variable. And the page also
    says: "The POSIX standard says that awk always uses the period as
    the decimal point when reading the awk program source code".
    So despite POSIX saying that, you have to use a variable named POSIXLY_CORRECT. - Do I need some more coffee to understand that?

    POSIXLY_CORRECT=1 (or equivalently `--posix` aka `-P`) affects numbers
    in the input your script reads (as shown in the previous post) and
    strings being converted to numbers in your code, it doesn't affect
    literal numbers in the source code for your script that awk reads.

    In the source code the decimal separator for a literal number (as
    opposed to a string being converted to a number) is always `.`.

    You can't use, say, a comma as the decimal separator in a literal number because a comma already means something in the awk syntax, e.g. `print
    4,321` means the same as "print 4 OFS 321`.

    For example, this code compares a literal number 4.321 to another
    literal number 4.1 and prints "bigger" because 4.321 is a bigger number
    than 4.1:

    $ awk 'BEGIN{ if (4.321 > 4.1) print "bigger"; else print "not" }'
    bigger

    while this attempt to use `,` as the decimal separator is a syntax error because `,` has a meaning in awk syntax:

    $ awk 'BEGIN{ if (4,321 > 4.1) print "bigger"; else print "not" }'
    awk: cmd. line:1: BEGIN{ if (4,321 > 4.1) print "bigger"; else print "not" } awk: cmd. line:1: ^ syntax error

    You can't just write the number as a string because then you're doing a
    string comparison and comparing the collation order of `.` vs `,`, not
    the values of the numbers:

    $ awk 'BEGIN{ if ("4,321" > 4.1) print "bigger"; else print "not" }'
    not

    You might think you can just add `0` to convert the string to a number:

    $ awk 'BEGIN{ if (("4,321"+0) > 4.1) print "bigger"; else print "not" }'
    not

    but by default in gawk that conversion truncates everything from the `,`
    on and so you end up comparing the number 4 to the number 4.1.

    To convert `"4,321"` to the number `4.321` in gawk you need to once
    again set your locale and tell gawk to use it:

    $ LC_ALL=en_DK.utf-8 awk 'BEGIN{ if (("4,321"+0) > 4.1) print "bigger";
    else print "not" }'
    not

    $ LC_ALL=en_DK.utf-8 awk -N 'BEGIN{ if (("4,321"+0) > 4.1) print
    "bigger"; else print "not" }'
    bigger

    $ LC_ALL=en_DK.utf-8 awk -P 'BEGIN{ if (("4,321"+0) > 4.1) print
    "bigger"; else print "not" }'
    bigger

    or in gawk instead of adding 0 you could use `strtonum()` on a string containing that value to convert it to a number. For example:

    $ LC_ALL=en_DK.utf-8 awk 'BEGIN{ if (strtonum("4,321") > 4.1) print
    "bigger"; else print "not" }'
    not

    $ LC_ALL=en_DK.utf-8 awk -N 'BEGIN{ if (strtonum("4,321") > 4.1) print "bigger"; else print "not" }'
    bigger

    $ LC_ALL=en_DK.utf-8 awk -P 'BEGIN{ if (strtonum("4,321") > 4.1) print "bigger"; else print "not" }'
    awk: cmd. line:1: fatal: function `strtonum' not defined

    In that last example above we see why `-N` is better than `-P` since
    they both do what you want with the number but `-P` also disables gawk extensions while `-N` doesn't.


    And I see there's an additional GNU Awk option '--use-lc-numeric'.

    That's just the long form of `-N`, identical in meaning.

    What a mess!

    (I suppose current status can only be explained by the mentioned forth-and-back during history of various GNU Awk versions.)

    Right.


    What's worth the LC_* variables if they are ignored (or maybe not).

    They have their uses but using them for everything by default apparently
    isn't the way people most frequently want to use awk so you need options
    to tell gawk when to use them in specific situations.

    Ed.


    Janis


    Regards,

    Ed


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Tue Aug 27 18:19:58 2024
    From Newsgroup: comp.lang.awk

    On 27.08.2024 12:20, Manuel Collado wrote:
    El 27/8/24 a las 2:31, Janis Papanagnou escribió:

    If I understand correctly that the library you mention would address
    the two topics (field separator and quoting) then there's even less a
    point (I suppose) to use the new '--csv' option in GNU Awk; just use
    your library instead?

    A user decision.

    No doubt.

    HTH

    Not the least for someone trying to understand any advantage of '--csv'
    given the existence of your (obviously more generalized) CSV library.

    Janis

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Tue Aug 27 18:23:25 2024
    From Newsgroup: comp.lang.awk

    On 27.08.2024 13:45, Ed Morton wrote:
    On 8/26/2024 8:39 PM, Janis Papanagnou wrote:

    I've missed that there was an explicit
    $ export POSIXLY_CORRECT=1
    set on the very top of these examples. Gee!

    POSIXLY_CORRECT=1 (or equivalently `--posix` aka `-P`) affects numbers
    in the input your script reads (as shown in the previous post) and
    strings being converted to numbers in your code, it doesn't affect
    literal numbers in the source code for your script that awk reads.

    In the source code the decimal separator for a literal number (as
    opposed to a string being converted to a number) is always `.`.

    You can't use, say, a comma as the decimal separator in a literal number because a comma already means something in the awk syntax, e.g. `print
    4,321` means the same as "print 4 OFS 321`.

    [examples and explanations]

    Thanks.

    Janis

    --- Synchronet 3.20a-Linux NewsLink 1.114