GNU bug report logs - #13032
24.3.50; Request: Provide a `delete-duplicate-lines' command

Previous Next

Package: emacs;

Reported by: Dani Moncayo <dmoncayo <at> gmail.com>

Date: Thu, 29 Nov 2012 19:26:01 UTC

Severity: wishlist

Found in version 24.3.50

Done: Juri Linkov <juri <at> jurta.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 13032 in the body.
You can then email your comments to 13032 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#13032; Package emacs. (Thu, 29 Nov 2012 19:26:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Dani Moncayo <dmoncayo <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Thu, 29 Nov 2012 19:26:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Dani Moncayo <dmoncayo <at> gmail.com>
To: bug-gnu-emacs <at> gnu.org
Subject: 24.3.50; Request: Provide a `delete-duplicate-lines' command
Date: Thu, 29 Nov 2012 20:23:16 +0100
Severity: wishlist

Recent versions of MS-Excel and also LibreOffice's Calc have a feature
that I find very useful: the ability of remove duplicate lines from a
given list (range).  I think it would be worth to add such a feature
to Emacs.

That is: provide a function `delete-duplicate-lines' (or some such)
that removes all duplicate lines in the active region and prints in
the echo area a message like "Duplicate lines removed: <n>".

TIA.

PS: There has been some discussion about this in this thread:
http://lists.gnu.org/archive/html/help-gnu-emacs/2012-11/msg00417.html.
 Jambunathan K provided a possible implementation, but it lacks the
message in the echo area (which I think is important).


In GNU Emacs 24.3.50.1 (i386-mingw-nt6.1.7601)
 of 2012-11-28 on MS-W7-DANI
Bzr revision: 111021 jay.p.belanger <at> gmail.com-20121128045113-o6xvwncuryx8al3u
Windowing system distributor `Microsoft Corp.', version 6.1.7601
Configured using:
 `configure --with-gcc (4.7) --no-opt --enable-checking --cflags
 -Ic:/emacs/libs/libXpm-3.5.10/include -Ic:/emacs/libs/libXpm-3.5.10/src
 -Ic:/emacs/libs/libpng-1.2.37-lib/include -Ic:/emacs/libs/zlib-1.2.5
 -Ic:/emacs/libs/giflib-4.1.4-1-lib/include
 -Ic:/emacs/libs/jpeg-6b-4-lib/include
 -Ic:/emacs/libs/tiff-3.8.2-1-lib/include
 -Ic:/emacs/libs/libxml2-2.7.8-w32-bin/include/libxml2
 -Ic:/emacs/libs/gnutls-3.0.9-w32-bin/include
 -Ic:/emacs/libs/libiconv-1.9.2-1-lib/include'

Important settings:
  value of $LANG: ENU
  locale-coding-system: cp1252
  default enable-multibyte-characters: t

-- 
Dani Moncayo




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13032; Package emacs. (Thu, 29 Nov 2012 20:53:02 GMT) Full text and rfc822 format available.

Message #8 received at 13032 <at> debbugs.gnu.org (full text, mbox):

From: Juanma Barranquero <lekktu <at> gmail.com>
To: Dani Moncayo <dmoncayo <at> gmail.com>
Cc: 13032 <at> debbugs.gnu.org
Subject: Re: bug#13032: 24.3.50;
	Request: Provide a `delete-duplicate-lines' command
Date: Thu, 29 Nov 2012 21:49:47 +0100
On Thu, Nov 29, 2012 at 8:23 PM, Dani Moncayo <dmoncayo <at> gmail.com> wrote:
> Severity: wishlist

> That is: provide a function `delete-duplicate-lines' (or some such)
> that removes all duplicate lines in the active region and prints in
> the echo area a message like "Duplicate lines removed: <n>".

Perhaps you can work from this (not very well tested):

(defun delete-duplicate-lines (beg end)
  "Delete consecutive duplicate lines in region BEG..END."
  (interactive "r")
  (save-excursion
    (save-restriction
      (narrow-to-region beg end)
      (goto-char beg)
      (let ((kill-whole-line t)
            (last (buffer-substring (line-beginning-position)
(line-end-position)))
            (removed 0)
            current)
        (forward-line 1)
        (while (and (< (point) (or end 1))
                    (not (eobp)))
          (setq current (buffer-substring (line-beginning-position)
(line-end-position)))
          (if (string= last current)
              (progn
                (kill-line)
                (setq removed (1+ removed)))
            (setq last current)
            (forward-line 1)))
        (message "Duplicate lines removed: %d" removed)))))




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13032; Package emacs. (Thu, 29 Nov 2012 21:46:02 GMT) Full text and rfc822 format available.

Message #11 received at 13032 <at> debbugs.gnu.org (full text, mbox):

From: Dani Moncayo <dmoncayo <at> gmail.com>
To: Juanma Barranquero <lekktu <at> gmail.com>
Cc: 13032 <at> debbugs.gnu.org
Subject: Re: bug#13032: 24.3.50;
	Request: Provide a `delete-duplicate-lines' command
Date: Thu, 29 Nov 2012 22:43:53 +0100
> Perhaps you can work from this (not very well tested):

Thank you Juanma.  I've given it a quick try and it seems to work.

I've only seen a minor detail that I don't like: when the command does
nothing (because there are no consecutive duplicate lines), the region
remains active.  But this is a general problem in Emacs which I've
already complained about (bug #10056).  IMO, the mark should be
deactivated after every command that operates on the active region,
without regard to whether the buffer was changed or not.  There could
be some exception, but this should be the general principle.

I'll put your version in my init file for now, while the maintainers
decide whether it is appropriate to add this command to Emacs or not.

Thanks.

-- 
Dani Moncayo




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13032; Package emacs. (Thu, 29 Nov 2012 22:48:02 GMT) Full text and rfc822 format available.

Message #14 received at 13032 <at> debbugs.gnu.org (full text, mbox):

From: Juanma Barranquero <lekktu <at> gmail.com>
To: Dani Moncayo <dmoncayo <at> gmail.com>
Cc: 13032 <at> debbugs.gnu.org
Subject: Re: bug#13032: 24.3.50;
	Request: Provide a `delete-duplicate-lines' command
Date: Thu, 29 Nov 2012 23:45:02 +0100
> I've only seen a minor detail that I don't like: when the command does
> nothing (because there are no consecutive duplicate lines), the region
> remains active.

Add a call to deactivate-mark at the end.

    Juanma




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13032; Package emacs. (Fri, 30 Nov 2012 00:42:01 GMT) Full text and rfc822 format available.

Message #17 received at 13032 <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> jurta.org>
To: Dani Moncayo <dmoncayo <at> gmail.com>
Cc: 13032 <at> debbugs.gnu.org
Subject: Re: bug#13032: 24.3.50;
	Request: Provide a `delete-duplicate-lines' command
Date: Fri, 30 Nov 2012 02:31:21 +0200
> That is: provide a function `delete-duplicate-lines' (or some such)
> that removes all duplicate lines in the active region and prints in
> the echo area a message like "Duplicate lines removed: <n>".

This is what I currently use to delete duplicate lines:

  C-u M-| awk -- '!a[$0]++' RET

Do you intend to create a Lisp function with the same result?




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13032; Package emacs. (Fri, 30 Nov 2012 00:50:02 GMT) Full text and rfc822 format available.

Message #20 received at 13032 <at> debbugs.gnu.org (full text, mbox):

From: Juanma Barranquero <lekktu <at> gmail.com>
To: Juri Linkov <juri <at> jurta.org>
Cc: 13032 <at> debbugs.gnu.org, Dani Moncayo <dmoncayo <at> gmail.com>
Subject: Re: bug#13032: 24.3.50;
	Request: Provide a `delete-duplicate-lines' command
Date: Fri, 30 Nov 2012 01:46:15 +0100
On Fri, Nov 30, 2012 at 1:31 AM, Juri Linkov <juri <at> jurta.org> wrote:

>   C-u M-| awk -- '!a[$0]++' RET

Isn't

  C-u M-| uniq RET

shorter and easier to type?




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13032; Package emacs. (Fri, 30 Nov 2012 00:54:01 GMT) Full text and rfc822 format available.

Message #23 received at 13032 <at> debbugs.gnu.org (full text, mbox):

From: Juanma Barranquero <lekktu <at> gmail.com>
To: Juri Linkov <juri <at> jurta.org>
Cc: 13032 <at> debbugs.gnu.org, Dani Moncayo <dmoncayo <at> gmail.com>
Subject: Re: bug#13032: 24.3.50;
	Request: Provide a `delete-duplicate-lines' command
Date: Fri, 30 Nov 2012 01:50:25 +0100
(FWIW, yes, I'm aware that your awk script and uniq don't do the same
thing, but I think what Dani requested was in fact removing
consecutive duplicates...)




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13032; Package emacs. (Fri, 30 Nov 2012 01:02:01 GMT) Full text and rfc822 format available.

Message #26 received at 13032 <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> jurta.org>
To: Juanma Barranquero <lekktu <at> gmail.com>
Cc: 13032 <at> debbugs.gnu.org, Dani Moncayo <dmoncayo <at> gmail.com>
Subject: Re: bug#13032: 24.3.50;
	Request: Provide a `delete-duplicate-lines' command
Date: Fri, 30 Nov 2012 02:57:53 +0200
> (FWIW, yes, I'm aware that your awk script and uniq don't do the same
> thing, but I think what Dani requested was in fact removing
> consecutive duplicates...)

I wonder why only consecutive duplicates?  The existing functions
`delete-duplicates' and `delete-dups' that operate on lists
don't delete just consecutive duplicates.  They delete all duplicates.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13032; Package emacs. (Fri, 30 Nov 2012 01:06:02 GMT) Full text and rfc822 format available.

Message #29 received at 13032 <at> debbugs.gnu.org (full text, mbox):

From: Juanma Barranquero <lekktu <at> gmail.com>
To: Juri Linkov <juri <at> jurta.org>
Cc: 13032 <at> debbugs.gnu.org, Dani Moncayo <dmoncayo <at> gmail.com>
Subject: Re: bug#13032: 24.3.50;
	Request: Provide a `delete-duplicate-lines' command
Date: Fri, 30 Nov 2012 02:02:36 +0100
On Fri, Nov 30, 2012 at 1:57 AM, Juri Linkov <juri <at> jurta.org> wrote:

> I wonder why only consecutive duplicates?  The existing functions
> `delete-duplicates' and `delete-dups' that operate on lists
> don't delete just consecutive duplicates.  They delete all duplicates.

Yes. Dani has not said what's his use case.

    Juanma




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13032; Package emacs. (Fri, 30 Nov 2012 01:41:02 GMT) Full text and rfc822 format available.

Message #32 received at 13032 <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> jurta.org>
To: Juanma Barranquero <lekktu <at> gmail.com>
Cc: 13032 <at> debbugs.gnu.org, Dani Moncayo <dmoncayo <at> gmail.com>
Subject: Re: bug#13032: 24.3.50;
	Request: Provide a `delete-duplicate-lines' command
Date: Fri, 30 Nov 2012 03:12:05 +0200
>>   C-u M-| awk -- '!a[$0]++' RET
>
> Isn't
>
>   C-u M-| uniq RET
>
> shorter and easier to type?

I use `uniq' only on files where lines are sorted.  OTOH, something like
'!a[$0]++' that is not limited to consecutive duplicates is better for
files where lines are not sorted such as log files, etc.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13032; Package emacs. (Fri, 30 Nov 2012 07:54:02 GMT) Full text and rfc822 format available.

Message #35 received at 13032 <at> debbugs.gnu.org (full text, mbox):

From: Dani Moncayo <dmoncayo <at> gmail.com>
To: Juri Linkov <juri <at> jurta.org>
Cc: Juanma Barranquero <lekktu <at> gmail.com>, 13032 <at> debbugs.gnu.org
Subject: Re: bug#13032: 24.3.50;
	Request: Provide a `delete-duplicate-lines' command
Date: Fri, 30 Nov 2012 08:51:34 +0100
>>>   C-u M-| awk -- '!a[$0]++' RET
>>
>> Isn't
>>
>>   C-u M-| uniq RET
>>
>> shorter and easier to type?
>
> I use `uniq' only on files where lines are sorted.  OTOH, something like
> '!a[$0]++' that is not limited to consecutive duplicates is better for
> files where lines are not sorted such as log files, etc.

My use cases usually involves compacting a collection of lines
gathered from several places.  So the compacting operation is normally
coupled with a sort operation.

Thus, the command provided by Juanma is good enough for these use
cases (I first do a `sort-lines' and then a `delete-duplicate-lines').

But I agree that it would be even better if `delete-duplicate-lines'
did TRT even when the lines are not sorted.  (I've just tested this
feature in MS-Excel, and it is so: it doesn't requires that the lines
are previously sorted)

Thank you.

-- 
Dani Moncayo




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13032; Package emacs. (Fri, 30 Nov 2012 07:54:02 GMT) Full text and rfc822 format available.

Message #38 received at 13032 <at> debbugs.gnu.org (full text, mbox):

From: Dani Moncayo <dmoncayo <at> gmail.com>
To: Juri Linkov <juri <at> jurta.org>
Cc: 13032 <at> debbugs.gnu.org
Subject: Re: bug#13032: 24.3.50;
	Request: Provide a `delete-duplicate-lines' command
Date: Fri, 30 Nov 2012 08:51:40 +0100
> This is what I currently use to delete duplicate lines:
>
>   C-u M-| awk -- '!a[$0]++' RET
>
> Do you intend to create a Lisp function with the same result?

I don't know awk, but I've tried that command and seems to do what I
want: remove all duplicate lines in the region.  Although it don't
inform about the number of lines deleted, which is important to me.

-- 
Dani Moncayo




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13032; Package emacs. (Sat, 01 Dec 2012 00:42:02 GMT) Full text and rfc822 format available.

Message #41 received at 13032 <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> jurta.org>
To: Dani Moncayo <dmoncayo <at> gmail.com>
Cc: Juanma Barranquero <lekktu <at> gmail.com>, 13032 <at> debbugs.gnu.org
Subject: Re: bug#13032: 24.3.50;
	Request: Provide a `delete-duplicate-lines' command
Date: Sat, 01 Dec 2012 02:34:41 +0200
>>>>   C-u M-| awk -- '!a[$0]++' RET
>
> But I agree that it would be even better if `delete-duplicate-lines'
> did TRT even when the lines are not sorted.  (I've just tested this
> feature in MS-Excel, and it is so: it doesn't requires that the lines
> are previously sorted)

Actually I use a slightly different command:

   C-u M-| tac | awk -- '!a[$0]++' | tac RET

because I need to keep the last duplicate line instead of the first.
`tac' reverses the lines, removes the duplicates keeping the first duplicate,
and another `tac' reverses lines back thus keeping the last duplicate.
So for `delete-duplicate-lines' to be useful in this case it could support
also the reverse search that keeps the last duplicate.

You can see this limitation described in docstrings of various functions at
http://emacswiki.org/emacs/DuplicateLines
as "keeping first occurrence", so these functions are of no help.

Adding an argument to keep either the first/last duplicate and an argument
to delete only adjacent lines, and using the algorithm like in awk,
and using the calling interface like in `flush-lines', necessitates
the following small function that can be called with the arg `C-u'
to keep the last duplicate line, and `C-u C-u' to delete only adjacent lines:

(defun delete-duplicate-lines (rstart rend &optional reverse adjacent interactive)
  "Delete duplicate lines in the region between RSTART and REND.
If REVERSE is nil, search and delete duplicates forward keeping the first
occurrence of duplicate lines.  If REVERSE is non-nil, search and delete
duplicates backward keeping the last occurrence of duplicate lines.
If ADJACENT is non-nil, delete repeated lines only if they are adjacent."
  (interactive
   (progn
     (barf-if-buffer-read-only)
     (list (region-beginning) (region-end)
           (equal current-prefix-arg '(4))
           (equal current-prefix-arg '(16))
           t)))
  (let ((lines (unless adjacent (make-hash-table :weakness 'key :test 'equal)))
        line prev-line
        (count 0)
        (rstart (copy-marker rstart))
        (rend (copy-marker rend)))
    (save-excursion
      (goto-char (if reverse rend rstart))
      (if (and reverse (bolp)) (forward-char -1))
      (while (if reverse
                 (and (> (point) rstart) (not (bobp)))
               (and (< (point) rend) (not (eobp))))
        (setq line (buffer-substring-no-properties
                    (line-beginning-position) (line-end-position)))
        (if (if adjacent (equal line prev-line) (gethash line lines))
            (progn
              (delete-region (progn (forward-line 0) (point))
                             (progn (forward-line 1) (point)))
              (if reverse (forward-line -1))
              (setq count (1+ count)))
          (if adjacent (setq prev-line line) (puthash line t lines))
          (forward-line (if reverse -1 1)))))
    (set-marker rstart nil)
    (set-marker rend nil)
    (when interactive
      (message "Deleted %d %sduplicate line%s%s"
               count
               (if adjacent "adjacent " "")
               (if (= count 1) "" "s")
               (if reverse " backward " "")))
    count))




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13032; Package emacs. (Sat, 01 Dec 2012 09:12:02 GMT) Full text and rfc822 format available.

Message #44 received at 13032 <at> debbugs.gnu.org (full text, mbox):

From: Dani Moncayo <dmoncayo <at> gmail.com>
To: Juri Linkov <juri <at> jurta.org>
Cc: Juanma Barranquero <lekktu <at> gmail.com>, 13032 <at> debbugs.gnu.org
Subject: Re: bug#13032: 24.3.50;
	Request: Provide a `delete-duplicate-lines' command
Date: Sat, 1 Dec 2012 10:08:49 +0100
> (defun delete-duplicate-lines (rstart rend &optional reverse adjacent interactive)
>   "Delete duplicate lines in the region between RSTART and REND.
> If REVERSE is nil, search and delete duplicates forward keeping the first
> occurrence of duplicate lines.  If REVERSE is non-nil, search and delete
> duplicates backward keeping the last occurrence of duplicate lines.
> If ADJACENT is non-nil, delete repeated lines only if they are adjacent."

Looks pretty fine to me.  Your version is more general and versatile.

Some comments:
* Why is needed the INTERACTIVE command?  I mean, Cannot that info
(whether the function has been called interactively) be retrieved
using some Lips primitive?
* In case the INTERACTIVE command is indeed necessary, it should be
explained in the docstring, no?
* I think that the docstring should explain also the return value
(number of duplicate lines deleted).

Thank you Juri.  I hope Stefan or Chong add this feature to Emacs.

-- 
Dani Moncayo




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13032; Package emacs. (Sat, 01 Dec 2012 09:25:01 GMT) Full text and rfc822 format available.

Message #47 received at 13032 <at> debbugs.gnu.org (full text, mbox):

From: Dani Moncayo <dmoncayo <at> gmail.com>
To: Juri Linkov <juri <at> jurta.org>
Cc: Juanma Barranquero <lekktu <at> gmail.com>, 13032 <at> debbugs.gnu.org
Subject: Re: bug#13032: 24.3.50;
	Request: Provide a `delete-duplicate-lines' command
Date: Sat, 1 Dec 2012 10:22:00 +0100
>> (defun delete-duplicate-lines (rstart rend &optional reverse adjacent interactive)
>>   "Delete duplicate lines in the region between RSTART and REND.
>> If REVERSE is nil, search and delete duplicates forward keeping the first
>> occurrence of duplicate lines.  If REVERSE is non-nil, search and delete
>> duplicates backward keeping the last occurrence of duplicate lines.
>> If ADJACENT is non-nil, delete repeated lines only if they are adjacent."
>
> Looks pretty fine to me.  Your version is more general and versatile.
>
> Some comments:
> * Why is needed the INTERACTIVE command?  I mean, Cannot that info
> (whether the function has been called interactively) be retrieved
> using some Lips primitive?
> * In case the INTERACTIVE command is indeed necessary, it should be
> explained in the docstring, no?
> * I think that the docstring should explain also the return value
> (number of duplicate lines deleted).

Sorry, replace "command" by "argument" in the above paragraph.

Another comment:
* I'm thinking that the ADJACENT argument is kinda unnecessary.  I
can't think of a use-case where someone wants to remove only the
_adjacent_ duplicate lines but not the ones which aren't adjacent.
So, I think that both the interface and the implementation could be
simplified by removing that argument.

-- 
Dani Moncayo




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13032; Package emacs. (Sun, 02 Dec 2012 00:50:02 GMT) Full text and rfc822 format available.

Message #50 received at 13032 <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> jurta.org>
To: Dani Moncayo <dmoncayo <at> gmail.com>
Cc: Juanma Barranquero <lekktu <at> gmail.com>, 13032 <at> debbugs.gnu.org
Subject: Re: bug#13032: 24.3.50;
	Request: Provide a `delete-duplicate-lines' command
Date: Sun, 02 Dec 2012 02:45:44 +0200
> * I'm thinking that the ADJACENT argument is kinda unnecessary.  I
> can't think of a use-case where someone wants to remove only the
> _adjacent_ duplicate lines but not the ones which aren't adjacent.
> So, I think that both the interface and the implementation could be
> simplified by removing that argument.

The ADJACENT argument is an optimization that doesn't require
additional memory (to store previous lines in the cache).
This is necessary when the user needs to delete duplicate lines
in a large sorted file.

> * Why is needed the INTERACTIVE argument?  I mean, Cannot that info
> (whether the function has been called interactively) be retrieved
> using some Lips primitive?

There is called-interactively-p but as I understood, it is unreliable.
This is why other similar commands like `flush-lines', `keep-lines',
`how-many' use the INTERACTIVE argument.  They use it for two purposes:
to decide whether the active region should be used, and to decide whether
the message should be displayed when called interactively.

> * In case the INTERACTIVE argument is indeed necessary, it should be
> explained in the docstring, no?

Yes, below I copied this part from the docstring of `how-many'.

> * I think that the docstring should explain also the return value
> (number of duplicate lines deleted).

Coincidentally, the return value will be explained in the same part
of the docstring.

The remaining problem is to decide where to put this command?
The file replace.el is unsuitable because unlike `flush-lines' and
unlike `how-many', `delete-duplicate-lines' doesn't use regexps.

It seems the right place is sort.el because it also contains a related
command `reverse-region'.  This patch puts `delete-duplicate-lines'
after `reverse-region' at the end of sort.el:

=== modified file 'lisp/sort.el'
--- lisp/sort.el	2012-08-03 08:15:24 +0000
+++ lisp/sort.el	2012-12-02 00:44:42 +0000
@@ -562,6 +562,59 @@ (defun reverse-region (beg end)
 	(setq ll (cdr ll)))
       (insert (car ll)))))
 
+;;;###autoload
+(defun delete-duplicate-lines (rstart rend &optional reverse adjacent interactive)
+  "Delete duplicate lines in the region between RSTART and REND.
+
+If REVERSE is nil, search and delete duplicates forward keeping the first
+occurrence of duplicate lines.  If REVERSE is non-nil (when called
+interactively with C-u prefix), search and delete duplicates backward
+keeping the last occurrence of duplicate lines.
+
+If ADJACENT is non-nil (when called interactively with two C-u prefixes),
+delete repeated lines only if they are adjacent.
+
+When called from Lisp and INTERACTIVE is omitted or nil, return the number
+of deleted duplicate lines, do not print it; if INTERACTIVE is t, the
+function behaves in all respects as if it had been called interactively."
+  (interactive
+   (progn
+     (barf-if-buffer-read-only)
+     (list (region-beginning) (region-end)
+	   (equal current-prefix-arg '(4))
+	   (equal current-prefix-arg '(16))
+	   t)))
+  (let ((lines (unless adjacent (make-hash-table :weakness 'key :test 'equal)))
+	line prev-line
+	(count 0)
+	(rstart (copy-marker rstart))
+	(rend (copy-marker rend)))
+    (save-excursion
+      (goto-char (if reverse rend rstart))
+      (if (and reverse (bolp)) (forward-char -1))
+      (while (if reverse
+		 (and (> (point) rstart) (not (bobp)))
+	       (and (< (point) rend) (not (eobp))))
+	(setq line (buffer-substring-no-properties
+		    (line-beginning-position) (line-end-position)))
+	(if (if adjacent (equal line prev-line) (gethash line lines))
+	    (progn
+	      (delete-region (progn (forward-line 0) (point))
+			     (progn (forward-line 1) (point)))
+	      (if reverse (forward-line -1))
+	      (setq count (1+ count)))
+	  (if adjacent (setq prev-line line) (puthash line t lines))
+	  (forward-line (if reverse -1 1)))))
+    (set-marker rstart nil)
+    (set-marker rend nil)
+    (when interactive
+      (message "Deleted %d %sduplicate line%s%s"
+	       count
+	       (if adjacent "adjacent " "")
+	       (if (= count 1) "" "s")
+	       (if reverse " backward " "")))
+    count))
+
 (provide 'sort)
 
 ;;; sort.el ends here





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13032; Package emacs. (Sun, 02 Dec 2012 09:17:01 GMT) Full text and rfc822 format available.

Message #53 received at 13032 <at> debbugs.gnu.org (full text, mbox):

From: Dani Moncayo <dmoncayo <at> gmail.com>
To: Juri Linkov <juri <at> jurta.org>
Cc: Juanma Barranquero <lekktu <at> gmail.com>, 13032 <at> debbugs.gnu.org
Subject: Re: bug#13032: 24.3.50;
	Request: Provide a `delete-duplicate-lines' command
Date: Sun, 2 Dec 2012 10:13:55 +0100
>> * I'm thinking that the ADJACENT argument is kinda unnecessary.  I
>> can't think of a use-case where someone wants to remove only the
>> _adjacent_ duplicate lines but not the ones which aren't adjacent.
>> So, I think that both the interface and the implementation could be
>> simplified by removing that argument.
>
> The ADJACENT argument is an optimization that doesn't require
> additional memory (to store previous lines in the cache).
> This is necessary when the user needs to delete duplicate lines
> in a large sorted file.

Ah, good point.  I guess that the optimization is twofold: in memory
and also in performance.  Then, IMO this should be explained in the
docstring, so that users know that they should use this feature when
running this command over a large chunk of lines.

Thank you.

-- 
Dani Moncayo




Reply sent to Juri Linkov <juri <at> jurta.org>:
You have taken responsibility. (Mon, 03 Dec 2012 23:53:01 GMT) Full text and rfc822 format available.

Notification sent to Dani Moncayo <dmoncayo <at> gmail.com>:
bug acknowledged by developer. (Mon, 03 Dec 2012 23:53:02 GMT) Full text and rfc822 format available.

Message #58 received at 13032-done <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> jurta.org>
To: Dani Moncayo <dmoncayo <at> gmail.com>
Cc: Juanma Barranquero <lekktu <at> gmail.com>, 13032-done <at> debbugs.gnu.org
Subject: Re: bug#13032: 24.3.50;
	Request: Provide a `delete-duplicate-lines' command
Date: Tue, 04 Dec 2012 01:49:29 +0200
>> The ADJACENT argument is an optimization that doesn't require
>> additional memory (to store previous lines in the cache).
>> This is necessary when the user needs to delete duplicate lines
>> in a large sorted file.
>
> Ah, good point.  I guess that the optimization is twofold: in memory
> and also in performance.  Then, IMO this should be explained in the
> docstring, so that users know that they should use this feature when
> running this command over a large chunk of lines.

Thanks for the suggestion, I added this as well.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13032; Package emacs. (Tue, 04 Dec 2012 00:12:02 GMT) Full text and rfc822 format available.

Message #61 received at 13032 <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> jurta.org>
To: 13032 <at> debbugs.gnu.org
Cc: dmoncayo <at> gmail.com
Subject: Re: bug#13032: 24.3.50;
	Request: Provide a `delete-duplicate-lines' command
Date: Tue, 04 Dec 2012 02:05:03 +0200
>>> The ADJACENT argument is an optimization that doesn't require
>>> additional memory (to store previous lines in the cache).
>>> This is necessary when the user needs to delete duplicate lines
>>> in a large sorted file.
>>
>> Ah, good point.  I guess that the optimization is twofold: in memory
>> and also in performance.  Then, IMO this should be explained in the
>> docstring, so that users know that they should use this feature when
>> running this command over a large chunk of lines.
>
> Thanks for the suggestion, I added this as well.

It just occurred to me that we could also add an alias `uniq' that will
call the command `delete-duplicate-lines' with non-nil ADJACENT arg.

We already have aliases like `mkdir' for `make-directory',
so the command `uniq' would be handy too.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13032; Package emacs. (Tue, 04 Dec 2012 07:08:02 GMT) Full text and rfc822 format available.

Message #64 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Thierry Volpiatto <thierry.volpiatto <at> gmail.com>
To: bug-gnu-emacs <at> gnu.org
Subject: Re: bug#13032: 24.3.50;
	Request: Provide a `delete-duplicate-lines' command
Date: Tue, 04 Dec 2012 08:04:13 +0100
Hi, just for info, here a simple and fast version.

Dani Moncayo <dmoncayo <at> gmail.com> writes:

>> This is what I currently use to delete duplicate lines:
>>
>>   C-u M-| awk -- '!a[$0]++' RET
>>
>> Do you intend to create a Lisp function with the same result?
>
> I don't know awk, but I've tried that command and seems to do what I
> want: remove all duplicate lines in the region.  Although it don't
> inform about the number of lines deleted, which is important to me.


--8<---------------cut here---------------start------------->8---
(defun delete-duplicate-lines (beg end)
  "Delete duplicate lines in region."
  (interactive "r")
  (save-excursion
    (save-restriction
      (narrow-to-region beg end)
      (let ((lines (helm-fast-remove-dups
                    (split-string (buffer-string) "\n" t)
                    :test 'equal)))
        (delete-region (point-min) (point-max))
        (loop for l in lines do (insert (concat l "\n")))))))
--8<---------------cut here---------------end--------------->8---

helm-fast-remove-dups is a function in helm:
https://github.com/emacs-helm/helm/blob/master/helm-utils.el
line 342

For the number of lines removed it is easy to modify the function to do
so.

-- 
  Thierry
Get my Gnupg key:
gpg --keyserver pgp.mit.edu --recv-keys 59F29997 





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13032; Package emacs. (Tue, 04 Dec 2012 09:17:02 GMT) Full text and rfc822 format available.

Message #67 received at 13032 <at> debbugs.gnu.org (full text, mbox):

From: Dani Moncayo <dmoncayo <at> gmail.com>
To: Juri Linkov <juri <at> jurta.org>
Cc: 13032 <at> debbugs.gnu.org
Subject: Re: bug#13032: 24.3.50;
	Request: Provide a `delete-duplicate-lines' command
Date: Tue, 4 Dec 2012 10:13:43 +0100
> It just occurred to me that we could also add an alias `uniq' that will
> call the command `delete-duplicate-lines' with non-nil ADJACENT arg.
>
> We already have aliases like `mkdir' for `make-directory',
> so the command `uniq' would be handy too.

Fine with me.

BTW, I've just noticed that the command doesn't deactivate the mark
when there is no duplicate lines in the region.  Could that be fixed?

Thank you.

-- 
Dani Moncayo




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13032; Package emacs. (Tue, 04 Dec 2012 14:50:02 GMT) Full text and rfc822 format available.

Message #70 received at 13032 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Thierry Volpiatto <thierry.volpiatto <at> gmail.com>
Cc: 13032 <at> debbugs.gnu.org
Subject: Re: bug#13032: 24.3.50;
	Request: Provide a `delete-duplicate-lines' command
Date: Tue, 04 Dec 2012 09:46:33 -0500
>       (let ((lines (helm-fast-remove-dups
>                     (split-string (buffer-string) "\n" t)
>                     :test 'equal)))
>         (delete-region (point-min) (point-max))
>         (loop for l in lines do (insert (concat l "\n")))))))

The inconvenient with this version is that any overlays/markers will
be lost, and the buffer will be marked as modified even if there were no
duplicate lines.


        Stefan




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13032; Package emacs. (Tue, 04 Dec 2012 15:06:01 GMT) Full text and rfc822 format available.

Message #73 received at 13032 <at> debbugs.gnu.org (full text, mbox):

From: Thierry Volpiatto <thierry.volpiatto <at> gmail.com>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: 13032 <at> debbugs.gnu.org
Subject: Re: bug#13032: 24.3.50;
	Request: Provide a `delete-duplicate-lines' command
Date: Tue, 04 Dec 2012 16:02:17 +0100
Stefan Monnier <monnier <at> iro.umontreal.ca> writes:

>>       (let ((lines (helm-fast-remove-dups
>>                     (split-string (buffer-string) "\n" t)
>>                     :test 'equal)))
>>         (delete-region (point-min) (point-max))
>>         (loop for l in lines do (insert (concat l "\n")))))))
>
> The inconvenient with this version is that any overlays/markers will
> be lost, and the buffer will be marked as modified even if there were no
> duplicate lines.
Ok, was just for info on a fast alternative without such enhancements.

-- 
  Thierry
Get my Gnupg key:
gpg --keyserver pgp.mit.edu --recv-keys 59F29997 




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13032; Package emacs. (Wed, 05 Dec 2012 00:09:02 GMT) Full text and rfc822 format available.

Message #76 received at 13032 <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> jurta.org>
To: Dani Moncayo <dmoncayo <at> gmail.com>
Cc: 13032 <at> debbugs.gnu.org
Subject: Re: bug#13032: 24.3.50;
	Request: Provide a `delete-duplicate-lines' command
Date: Wed, 05 Dec 2012 01:51:47 +0200
>> It just occurred to me that we could also add an alias `uniq' that will
>> call the command `delete-duplicate-lines' with non-nil ADJACENT arg.
>>
>> We already have aliases like `mkdir' for `make-directory',
>> so the command `uniq' would be handy too.
>
> Fine with me.

But the problem is that `uniq' might be confused with a similarly named
feature `uniquify' that uniquifies buffer names.

> BTW, I've just noticed that the command doesn't deactivate the mark
> when there is no duplicate lines in the region.  Could that be fixed?

This problem is not specific to `delete-duplicate-lines'.
All similar functions like e.g. `delete-matching-lines',
`delete-non-matching-lines' and `delete-blank-lines'
behave the same way.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13032; Package emacs. (Wed, 05 Dec 2012 08:09:02 GMT) Full text and rfc822 format available.

Message #79 received at 13032 <at> debbugs.gnu.org (full text, mbox):

From: Dani Moncayo <dmoncayo <at> gmail.com>
To: Juri Linkov <juri <at> jurta.org>
Cc: 13032 <at> debbugs.gnu.org
Subject: Re: bug#13032: 24.3.50;
	Request: Provide a `delete-duplicate-lines' command
Date: Wed, 5 Dec 2012 09:08:07 +0100
>>> It just occurred to me that we could also add an alias `uniq' that will
>>> call the command `delete-duplicate-lines' with non-nil ADJACENT arg.
>>>
>>> We already have aliases like `mkdir' for `make-directory',
>>> so the command `uniq' would be handy too.
>>
>> Fine with me.
>
> But the problem is that `uniq' might be confused with a similarly named
> feature `uniquify' that uniquifies buffer names.

Indeed.  That is the problem of using such ambiguous names.  FWIW, I
have no particular interest in this `uniq' alias.

>> BTW, I've just noticed that the command doesn't deactivate the mark
>> when there is no duplicate lines in the region.  Could that be fixed?
>
> This problem is not specific to `delete-duplicate-lines'.
> All similar functions like e.g. `delete-matching-lines',
> `delete-non-matching-lines' and `delete-blank-lines'
> behave the same way.

Indeed.  I filed bug #10056 because of this kind of problem.  I've
included these cases in that bug report.

Thank you.

-- 
Dani Moncayo




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Wed, 02 Jan 2013 12:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 12 years and 172 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.