Package: coreutils;
Reported by: "Polehn, Mike A" <mike.a.polehn <at> intel.com>
Date: Fri, 10 Oct 2014 17:30:02 UTC
Severity: normal
Done: Bob Proulx <bob <at> proulx.com>
Bug is archived. No further changes may be made.
View this message in rfc822 format
From: Linda Walsh <coreutils <at> tlinx.org> To: Bob Proulx <bob <at> proulx.com> Cc: mike.a.polehn <at> intel.com, 18681 <at> debbugs.gnu.org Subject: bug#18681: cp Specific fail example Date: Sun, 19 Oct 2014 23:20:00 -0700
Bob Proulx wrote: > Linda Walsh wrote: >> Bob Proulx wrote: >>> Also consider that if cp were to acquire all of the enhancements >>> that have been requested for cp as time has gone by then cp would >>> be just as featureful (bloated!) as rsync and likely just as slow >>> as rsync too. >> Nope...rsync is slow because it does everything over a client >> server model --- even when it is local. So everything is written through >> a pipe .. that's why it can't come close to cp -- and why cp would never >> be so slow -- I can't imagine it using a pipe to copy a file anywhere! > > The client-server structure of rsync is required for copying between > systems. Saying that cp doesn't have it isn't fair if cp were to add > every requested feature. --- cp was designed for local->local copy. rsync was designed for local->remote synchronization (thus 'r(emote) sync'. Saying it isn't fair to compare code quality between a java->'native code compiler' and a compiler developed for a native platform is entirely fair -- because both started out with different design goals -- thus each ends up with pluses and minus that are an effect of that goal. If you claim comparing such effects isn't fair, then it's not fair to compare any different algorithm with another because algorithms inherently have their pluses and minuses and are often chosen for use in a particular situation because of those pluses and minuses. So lets compare using 'cp' with rsync in copying a remote file. The choice of tools depends on the quality of the remote connection, but in most remote connections, "today", reliability isn't usually an issue as they flow over TCP and file transfer protocols like NFS or CIFS also have checks to allow users to reconnect after an interruption (like a machine reboot). Depending on timeout settings, 'cp' already has a restart over-remove ability when used with NFS or CIFS. CIFS doesn't tolerate a system reboot in the middle of a copy, whereas NFS can recover from such if the client uses hard mounts. But for a local network, I regularly use 'cp' with CIFS and it does a faster job than rsync -- over a reliable local network. > I am sure that if I search the archives I > would find a request to add client-server structure to cp to support > copying from system to system. :-) ---- We are comparing where the tools are at _not_ where they _could_ have been had previous algorithm choices been ignored. We are talking about a local->local copy (in the base note), so glossing over the slowness of rsync in doing such is entirely fair. If you want some level of recovery after interrupt, NFS is a better choice for a local network -- client connections can continue even after a server reboot. But if we are talking local->local reliability, the simple, close solution would be SMB/CIFS. Using a 1GB file as an example (and throwing in a 'dd' for for comparison): > time rsync 1G ishtar:/home/law/1G 20.13sec 1.29usr 2.68sys (19.73% cpu) > time cp 1G /h/. 6.94sec 0.01usr 1.10sys (16.16% cpu) > time dd if=1G of=/h/1G bs=256M oflag=direct 4+0 records in 4+0 records out 1073741824 bytes (1.1 GB) copied, 3.4694 s, 309 MB/s 3.50sec 0.00usr 0.51sys (14.64% cpu) Here again, we see rsync doing the same job of cp taking about 3x the time. For a single file over a local net 'dd' is a better bet. > > Now I will proactively agree that it would be nice if rsync detected > that it was all running locally and didn't fork and instead ran > everything in one process like cp does. But I could see that coming > to rsync at some time in the future. It is an often requested > feature. --- For many years. >>> This is something to consider every time someone asks for a >>> creeping feature to cp. Especially if they say they want the feature >>> in cp because it is faster than rsync. The natural progression is >>> that cp would become rsync. >> Not even! Note. cp already has a comparison function >> built in that it uses during "cp -u"... > > I am not convinced of the robustness of 'cp -u ...' interrupt, repeat, > interrupt repeat. It wasn't intended for that mode. --- Neither is rsync in its default mode. It compares timestamps and size, nothing more. I'd be suspicious of either rsync OR cp's chances in such a situation. But USUALLY, people don't interrupt a copy many times -- or even once, so cp is usually faster... > Is there any code path that could leave a new file in the target area > that would avoid copy? Not sure. Newer meets the -u test but isn't > an exact copy if the time stamp were older in the original. But with > rsync I know it will correct for this during a subsequent run. --- Not necessarily. It doesn't do checksumming by default. Certainly, if you used rsync with '-u', rsync will not be much better in recovery, since target files with more recent timestamps may be left in the target dir. I don't think rsync or cp trap a control-c-abort to cleanup target files. > >> built in that it uses during "cp -u"... but it doesn't go through >> pipes. It used to use larger buffer sizes or maybe tell posix >> to pre-alloc the destination space, dunno, but it used to be >> faster.. I can't say for certain, but it seems to be using > > Often the data sizes we work with grow larger over time making the > same task feel slower because we are actually dealing with more data > now. --- I was comparing copy times with same files, not from years ago to now. >> Another reason rsync is so slow -- uses >> a relatively small i/o size 1-4k last I looked. I've asked them >> to increase it, but going through a pipe it won't help alot. > > Nod. Rsync was designed for the network use case. It could benefit > with some tuning for the local case. A topic for the rsync list. --- Been there, done that. Still comparing current-to-current, not hypotheticals. > >> Also in rsync, they've added the posix calls to reserve >> space in the target location for a file being copied in. >> Specifically, this is to lower disk fragmentation (does >> cp do anything like that, been a while since I looked). > > I don't know. It would be worth a look. > >>> The advantage of rsync is that it can be interrupted and restarted and >>> the restarted process will efficiently avoid doing work that is >>> already done. An interrupted and restarted cp will perform the same >>> work again from start to finish. >> I wouldn't trust that it would. If you interrupt it at exactly >> the wrong time, I'd be afraid some file might get set with the right >> data but the wrong Meta info (acls, primarily). > > The design of rsync is to copy the file to a temporary name beside the > intended target. After the copy the timestamps are set. After that > the timestamps are set the file is renamed into place. An interrupt > that happens before that rename time will cause the temporary file to > be removed. An interrupt that happens after the rename is, well, > after that and the copy is already done. Since rename on the local > file system is atomic this is guaranteed to function robustly. (As > long as you aren't using a buggy file system that changes the order of > operations. That isn't cool. But of course it was famously seen in > ext4 for a while. Fortunately sanity has prevailed and ext4 doesn't > do that for this operation anymore. Okay to use now.) > >>> If I am doing a simple copy from A to B then I use 'cp -av A B'. If I >>> am doing it the second time then I will use rsync to avoid repeating >>> previously done work 'rsync -av A B'. >> Wouldn't cp -auv A B do the same? > > Do I have to go look at the source code to verify that it doesn't? :-( --- My timing says cp is 20x faster for that 1G file case. It also shows that rsync doesn't use a tmp file in the update case > time cp -au 1G /h 0.03sec 0.00usr 0.03sys (79.47% cpu) > cp -au 1G /h > time rsync -au 1G ishtar:/home/law/1G 0.60sec 0.06usr 0.09sys (25.12% cpu) > > I assume it doesn't without looking. I assume cp copies in place. I > assume that cp does not make a temporary file off to the side and > rename it into place once it is done and has set the timestamps. --- I assume rsync doesn't either -- if it is comparing against a file already in place, for it to transfer the whole file... nope. I > assume that cp copies to the named destination directly and updates > the timestamps afterward. That creates a window of time when the file > is in place but has not had the timestamp placed on it yet. > > Which means that if the cp is interrupted on a large file that it will > have started the copy but will not have finished it at the moment that > it is interrupted. The new file will be in place with a new > timestamp. The second run with cp -u will avoid overwriting the file > because the timestamp is newer. However the contents of the file will > be incomplete, or at least not matching the source copy at the time of > the second copy. > > If my assumptions in the above are wrong please correct me. I will > learn something. But the operating model would need to be the same > portably across all portable systems covered by posix before I would > consider it actually safe to use. --- Same happens in rsync -- no tmp file is involved. It compares time stamps and doesn't copy. > >>> If I want progress indication... If I want placement of backup files >>> in a particular directory... If I want other fancy features that are >>> provided by rsync then it is worth it to use rsync. >>> ...trimmed simple benchmark... >>> $ time cp -a coreutils junk/ >> By default cp -a transfers acls and ext-attrs and preserves >> hard links. Rsync doesn't do any of that by default. >> You need to use "-aHAX" to compare them ... > > Good catch. :-) > >> you have to call them >> out as 'extra' with rsync, so the above test may not be what it seems. >> Though if you don't use ACL's (which I do), then maybe the above >> is almost reasonable. Still.. should use -aHAX > > I didn't have any hard links, ACLs, or extended attributes in the test > case it shouldn't matter for the above. > >> Is your rsync newer? i.e. does it have the posix-pre-alloc >> hints?... Mine has a pre-alloc patch, but I think that was >> suse-added and not the one in the mainline code. Not sure. >> >> rsync --version >> rsync version 3.1.0 protocol version 31 >> 64-bit files, 64-bit inums, 64-bit timestamps, 64-bit long ints, >> socketpairs, hardlinks, symlinks, IPv6, batchfiles, inplace, >> append, ACLs, xattrs, iconv, symtimes, prealloc, SLP > > I happened to run that test on Debian Sid and it is 3.1.1. However > Debian Stable, which I have most widely deployed, has 3.0.9. So you > are both ahead of and behind me at the same time. :-) > >> Throw a few TB copies at rsync -- where all the data >> won't fit in memory.... it also, I'm told, has problems with >> hardlinks, acls and xattrs slowing it down, so it may be a >> matter of usage... > > I have had problems running rsync with -H for large data sets. Bad > enough that I recommend against it. Don't do it! I don't know > anything about -A and -X. But rsync -a is fine for very large data > sets. ---- But then you can't compare to 'cp' which does handle that case. >> (don't ya just love performance talk?) > > Except that we should have moved all of this to the discussion list. --- :-( ?discussion list? -- bugs-coreutils? (don't know about others)... 'sides, I didn't bring up rsync, all I added was "If rsync wasn't so slow at local I/O...*sigh*.... " Its good for when you need "diffs", but not as a general replacement for 'cp'.
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.