GNU bug report logs - #61300
wc -c doesn't advance stdin position when it's a regular file

Previous Next

Package: coreutils;

Reported by: Stephane Chazelas <stephane <at> chazelas.org>

Date: Sun, 5 Feb 2023 18:28:02 UTC

Severity: normal

To reply to this bug, email your comments to 61300 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#61300; Package coreutils. (Sun, 05 Feb 2023 18:28:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Stephane Chazelas <stephane <at> chazelas.org>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Sun, 05 Feb 2023 18:28:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Stephane Chazelas <stephane <at> chazelas.org>
To: bug-coreutils <at> gnu.org
Subject: wc -c doesn't advance stdin position when it's a regular file
Date: Sun, 5 Feb 2023 18:27:28 +0000
"wc -c" without filename arguments is meant to read stdin til
EOF and report the number of bytes it has read.

When stdin is on a regular file, GNU wc has that optimisation
whereby it skips the reading, does a pos = lseek(0,0,SEEK_CUR)
to find out its current position within the file, fstat(0) and
reports st_size - pos (assuming st_size > pos).

However, it does not move the position to the end of the file.
That means for instance that:

$ echo test > file
$ { wc -c; wc -c; } < file
5
5

Instead of 5, then 0:

$ { wc -c; cat; } < file
5
test

So the optimisation is incomplete.

It also reports the size of the file even if it could not possibly read it
because it's not open in read mode:

{ wc -c; } 0>> file
5

IMO, it should only do the optimisation if
- fcntl(F_GETFL) to check that the file is opened in O_RDONLY or O_RDWR
- current checks for /proc /sys-like filesystems
- pos > st_size
- lseek(0,st_size,SEEK_POS) is successful.

(that leaves a race window above where it could move the cursor
backward, but I would think that can be ignored as if something
else reads at the same time, there's not much we can expect
anyway).

-- 
Stephane




Information forwarded to bug-coreutils <at> gnu.org:
bug#61300; Package coreutils. (Sun, 05 Feb 2023 20:01:02 GMT) Full text and rfc822 format available.

Message #8 received at 61300 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Stephane Chazelas <stephane <at> chazelas.org>, 61300 <at> debbugs.gnu.org
Subject: Re: bug#61300: wc -c doesn't advance stdin position when it's a
 regular file
Date: Sun, 5 Feb 2023 19:59:58 +0000
[Message part 1 (text/plain, inline)]
On 05/02/2023 18:27, Stephane Chazelas wrote:
> "wc -c" without filename arguments is meant to read stdin til
> EOF and report the number of bytes it has read.
> 
> When stdin is on a regular file, GNU wc has that optimisation
> whereby it skips the reading, does a pos = lseek(0,0,SEEK_CUR)
> to find out its current position within the file, fstat(0) and
> reports st_size - pos (assuming st_size > pos).
> 
> However, it does not move the position to the end of the file.
> That means for instance that:
> 
> $ echo test > file
> $ { wc -c; wc -c; } < file
> 5
> 5
> 
> Instead of 5, then 0:
> 
> $ { wc -c; cat; } < file
> 5
> test
> 
> So the optimisation is incomplete.
> 
> It also reports the size of the file even if it could not possibly read it
> because it's not open in read mode:
> 
> { wc -c; } 0>> file
> 5
> 
> IMO, it should only do the optimisation if
> - fcntl(F_GETFL) to check that the file is opened in O_RDONLY or O_RDWR
> - current checks for /proc /sys-like filesystems
> - pos > st_size
> - lseek(0,st_size,SEEK_POS) is successful.
> 
> (that leaves a race window above where it could move the cursor
> backward, but I would think that can be ignored as if something
> else reads at the same time, there's not much we can expect
> anyway).

Yes I agree.

Adjusting would also avoid the following inconsistencies:

$ { wc -c; wc -c; } < file
5
5

$ { wc -l; wc -l; } < file
1
0

$ truncate -s $(getconf PAGESIZE) file
$ { wc -c; wc -c; } < file
4096
0

Hopefully the attached addresses this.
Note it doesn't add the constraint on the input being readable,
which I'll think a bit more about.

cheers,
Pádraig
[wc-update-offset.patch (text/x-patch, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#61300; Package coreutils. (Sun, 05 Feb 2023 21:00:02 GMT) Full text and rfc822 format available.

Message #11 received at 61300 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Pádraig Brady <P <at> draigBrady.com>,
 Stephane Chazelas <stephane <at> chazelas.org>, 61300 <at> debbugs.gnu.org
Subject: Re: bug#61300: wc -c doesn't advance stdin position when it's a
 regular file
Date: Sun, 5 Feb 2023 12:59:49 -0800
On 2023-02-05 11:59, Pádraig Brady wrote:

> Hopefully the attached addresses this. 

Thanks for fixing that.

> Note it doesn't add the constraint on the input being readable,
> which I'll think a bit more about.

Let's leave that as-is, please. If 'wc' can output the correct value 
without reading its input, POSIX does not require 'wc' to do the read, 
and it seems perverse to modify 'wc' to go to the effort to refuse to 
tell the user useful information that the user requested and that 'wc' 
knows.





Information forwarded to bug-coreutils <at> gnu.org:
bug#61300; Package coreutils. (Mon, 06 Feb 2023 06:28:01 GMT) Full text and rfc822 format available.

Message #14 received at 61300 <at> debbugs.gnu.org (full text, mbox):

From: Stephane Chazelas <stephane <at> chazelas.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Pádraig Brady <P <at> draigbrady.com>, 61300 <at> debbugs.gnu.org
Subject: Re: bug#61300: wc -c doesn't advance stdin position when it's a
 regular file
Date: Mon, 06 Feb 2023 06:27:02 +0000
On 2023-02-05 20:59, Paul Eggert wrote:
> On 2023-02-05 11:59, Pádraig Brady wrote:
[...]
> Let's leave that as-is, please. If 'wc' can output the correct value
> without reading its input, POSIX does not require 'wc' to do the read,
> and it seems perverse to modify 'wc' to go to the effort to refuse to
> tell the user useful information that the user requested and that 'wc'
> knows.
[...]

But while I would agree it's very unlikely to ever be hit in practice,
as I can't think of any reason why one would call wc with its input not
input for reading, wc is meant to report how many bytes it has read, not
the size of its input (though POSIX seems ambiguous on that).

See also (with Pádraig's patch applied):

$ { echo test > file; wc -c; echo test2 >&0; cat file; } 0> file
5
test
test2

wc has lseek()ed to the end of the file even though it was opened in 
write-only mode. Compare with:

$ { echo test > file; wc -lc; echo test2 >&0; cat file; } 0> file
wc: 'standard input': Bad file descriptor
0 0
test2

-- 
Stephane




Information forwarded to bug-coreutils <at> gnu.org:
bug#61300; Package coreutils. (Mon, 06 Feb 2023 19:39:01 GMT) Full text and rfc822 format available.

Message #17 received at 61300 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Stephane Chazelas <stephane <at> chazelas.org>, Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 61300 <at> debbugs.gnu.org
Subject: Re: bug#61300: wc -c doesn't advance stdin position when it's a
 regular file
Date: Mon, 6 Feb 2023 19:38:24 +0000
On 06/02/2023 06:27, Stephane Chazelas wrote:
> On 2023-02-05 20:59, Paul Eggert wrote:
>> On 2023-02-05 11:59, Pádraig Brady wrote:
> [...]
>> Let's leave that as-is, please. If 'wc' can output the correct value
>> without reading its input, POSIX does not require 'wc' to do the read,
>> and it seems perverse to modify 'wc' to go to the effort to refuse to
>> tell the user useful information that the user requested and that 'wc'
>> knows.
> [...]
> 
> But while I would agree it's very unlikely to ever be hit in practice,
> as I can't think of any reason why one would call wc with its input not
> input for reading, wc is meant to report how many bytes it has read, not
> the size of its input (though POSIX seems ambiguous on that).
> 
> See also (with Pádraig's patch applied):
> 
> $ { echo test > file; wc -c; echo test2 >&0; cat file; } 0> file
> 5
> test
> test2
> 
> wc has lseek()ed to the end of the file even though it was opened in
> write-only mode. Compare with:
> 
> $ { echo test > file; wc -lc; echo test2 >&0; cat file; } 0> file
> wc: 'standard input': Bad file descriptor
> 0 0
> test2

Some more thoughts on this.

Note the orig thread with motivation for the st_size optimization is at:
https://lists.gnu.org/archive/html/coreutils/2016-03/msg00020.html
Note also wc -c has had an st_size optimization for all sizes
since the very first coreutils implementation.

A similar edge case to Stehpane's above is also seen when doing
the lseek(near_end)+read() method, as shown by:

  ${ truncate -s 32768 file; wc -c; wc -c; } 0> file
  wc: 'standard input': Bad file descriptor
  28679
  wc: 'standard input': Bad file descriptor
  0

One possible solution is avoid the above issue is:

  start_pos=lseek(0,SEEK_CUR);
  bytes += lseek(near_end)
  while (read())
    {
      if (did_lseek && read error == EBADF|EINVAL)
        lseek(start_pos); did_lseek=false; bytes=0; continue;
    }

That would also fix an issue I saw for one file in /sys, where:
  /sys/devices/pci0000:00/0000:00:02.0/rom
  st_size = 131072, available bytes = 0, wc -c = 127007 (EINVAL)

Doing that method for all file sizes rather than just using st_size,
would work but also penalize perf for the common case.
Consider cached stats on a network file system for example.
So I guess in addition to be able to keep the st_size optimization
with stdin, consistent with other cases we could verify/restrict
to readable also.

Note this is only an issue for stdin. Files specified on the command line
and explicitly opened, should get a permission error at that stage.

Note also if you really want to read, you can always `cat | wc -c`
rather than just `wc -c`, so I'm still not sure we should
add the readable restriction for stdin, but I'm not very against it
at least since it is such an edge case.

cheers,
Pádraig




Information forwarded to bug-coreutils <at> gnu.org:
bug#61300; Package coreutils. (Mon, 06 Feb 2023 19:51:01 GMT) Full text and rfc822 format available.

Message #20 received at 61300 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Pádraig Brady <P <at> draigBrady.com>,
 Stephane Chazelas <stephane <at> chazelas.org>
Cc: 61300 <at> debbugs.gnu.org
Subject: Re: bug#61300: wc -c doesn't advance stdin position when it's a
 regular file
Date: Mon, 6 Feb 2023 11:50:37 -0800
On 2/6/23 11:38, Pádraig Brady wrote:
> Note also if you really want to read, you can always `cat | wc -c`
> rather than just `wc -c`

Even that's not guaranteed, as 'cat' is not required to use the 'read' 
system call if it can determine that the standard input contains only 
NULs without calling 'read'. (GNU 'cat' doesn't do this, but POSIX 
allows it.)

We shouldn't complicate 'wc' (thus slowing it down and worse, possibly 
introducing a bug) if the only goal is to make 'wc' fail more often in 
implausible scenarios.




This bug report was last modified 2 years and 134 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.