From unknown Mon Jun 23 15:02:05 2025 X-Loop: help-debbugs@gnu.org Subject: bug#55315: [elpa/csv-mode] [PATCH] CSV separator guessing Resent-From: Simen =?UTF-8?Q?Heggest=C3=B8yl?= Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Sun, 08 May 2022 14:13:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 55315 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: patch To: 55315@debbugs.gnu.org X-Debbugs-Original-To: bug-gnu-emacs@gnu.org Received: via spool by submit@debbugs.gnu.org id=B.165201915212722 (code B ref -1); Sun, 08 May 2022 14:13:02 +0000 Received: (at submit) by debbugs.gnu.org; 8 May 2022 14:12:32 +0000 Received: from localhost ([127.0.0.1]:54959 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1nnhe8-0003J3-L9 for submit@debbugs.gnu.org; Sun, 08 May 2022 10:12:32 -0400 Received: from lists.gnu.org ([209.51.188.17]:60246) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1nnhe5-0003Iv-LS for submit@debbugs.gnu.org; Sun, 08 May 2022 10:12:26 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:37736) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nnhe5-00064t-CQ for bug-gnu-emacs@gnu.org; Sun, 08 May 2022 10:12:25 -0400 Received: from mailtransmit05.runbox.com ([2a0c:5a00:149::26]:39210) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nnhe1-00016I-LD for bug-gnu-emacs@gnu.org; Sun, 08 May 2022 10:12:24 -0400 Received: from mailtransmit03.runbox ([10.9.9.163] helo=aibo.runbox.com) by mailtransmit05.runbox.com with esmtps (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.93) (envelope-from ) id 1nnhdv-0079Fx-1v for bug-gnu-emacs@gnu.org; Sun, 08 May 2022 16:12:15 +0200 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=runbox.com; s=selector2; h=Content-Type:MIME-Version:Message-ID:Date:Subject:To:From; bh=DF8lGvDUd4/t9H7P7DcM+DwOq49eLjPEhl3n7Hxbj/A=; b=ZmdOQmq0by+fZvnoU9rsq8+hG fVSr3OfqE4TS7eniPFL6ucfscUAYunbygwc5RW8+ml97TigRzSPkhw543Swn+usvqox7qYesMBrVk Um3pqQREpCs1nyi6V+p1CwCLFIEp+A36zU7o/uSd05YwNDjavr2KstxIdtjjkPPOgfnz32ZRXkndc 7Pop6Nlkgix0xc1E05qSyhRoUhL8GRgZqI47g1THJ8NH6cWcks+3eD53QI7jMIDVtGonZ75MqKZtK AcyZGhkS5//gGjpUtj0SXbzhlIpH3AHG2GoJ0tzaX7FMtOKTCCFho8AITHsPe77OISO+qiwWTy1Z3 ZJ0wDWAbg==; Received: from [10.9.9.74] (helo=submission03.runbox) by mailtransmit03.runbox with esmtp (Exim 4.86_2) (envelope-from ) id 1nnhdu-0007Fm-GM for bug-gnu-emacs@gnu.org; Sun, 08 May 2022 16:12:14 +0200 Received: by submission03.runbox with esmtpsa [Authenticated ID (963757)] (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) id 1nnhdh-0006GA-ME for bug-gnu-emacs@gnu.org; Sun, 08 May 2022 16:12:01 +0200 From: Simen =?UTF-8?Q?Heggest=C3=B8yl?= Date: Sun, 08 May 2022 16:12:00 +0200 Message-ID: <87h760jeq7.fsf@simenheg@gmail.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.1 (gnu/linux) MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" Received-SPF: pass client-ip=2a0c:5a00:149::26; envelope-from=simenheg@runbox.com; helo=mailtransmit05.runbox.com X-Spam_score_int: -17 X-Spam_score: -1.8 X-Spam_bar: - X-Spam_report: (-1.8 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, MSGID_MULTIPLE_AT=1, RCVD_IN_DNSWL_LOW=-0.7, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-Spam-Score: 1.2 (+) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has NOT identified this incoming email as spam. The original message has been attached to this so you can view it or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: Hi. Attached is a proposed patch to csv-mode.el in GNU ELPA which adds CSV separator guessing functionality to CSV mode. It adds two new commands: `csv-guess-set-separator' that automatically guesses and sets the CSV separator of the current buffer, and `csv-set-separator' for setting it manually. Content analysis details: (1.2 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- 1.0 MSGID_MULTIPLE_AT Message-ID contains multiple '@' characters -0.0 SPF_HELO_PASS SPF: HELO matches SPF record 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider (simenheg[at]runbox.com) 0.9 SPF_FAIL SPF: sender does not match SPF record (fail) [SPF failed: Please see http://www.openspf.org/Why?s=mfrom; id=simenheg%40runbox.com; ip=209.51.188.17; r=debbugs.gnu.org] -2.3 RCVD_IN_DNSWL_MED RBL: Sender listed at https://www.dnswl.org/, medium trust [209.51.188.17 listed in list.dnswl.org] -0.0 RCVD_IN_MSPIKE_H2 RBL: Average reputation (+2) [209.51.188.17 listed in wl.mailspike.net] -0.0 T_SCC_BODY_TEXT_LINE No description available. 1.5 SPOOFED_FREEMAIL No description available. X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) --=-=-= Content-Type: text/plain Hi. Attached is a proposed patch to csv-mode.el in GNU ELPA which adds CSV separator guessing functionality to CSV mode. It adds two new commands: `csv-guess-set-separator' that automatically guesses and sets the CSV separator of the current buffer, and `csv-set-separator' for setting it manually. The idea is that `csv-guess-set-separator' can be useful to add to the mode hook to have CSV mode guess and set the separator automatically when visiting a buffer: (add-hook 'csv-mode-hook 'csv-guess-set-separator) Been using it myself for the past weeks and have been happy with it so far. --=-=-= Content-Type: text/x-diff; charset=utf-8 Content-Disposition: attachment; filename=0001-Add-CSV-separator-guessing-functionality.patch Content-Transfer-Encoding: quoted-printable >From 7414f7e17ede47c392ce8d401d28ef17513c10e7 Mon Sep 17 00:00:00 2001 From: =3D?UTF-8?q?Simen=3D20Heggest=3DC3=3DB8yl?=3D Date: Sun, 8 May 2022 16:01:35 +0200 Subject: [PATCH] Add CSV separator guessing functionality Add two new commands: `csv-guess-set-separator' that automatically guesses and sets the CSV separator of the current buffer, and `csv-set-separator' for setting it manually. `csv-guess-set-separator' can be useful to add to the mode hook to have CSV mode guess and set the separator automatically when visiting a buffer: (add-hook 'csv-mode-hook 'csv-guess-set-separator) * csv-mode.el (csv-separators): Properly quote regexp values. (csv--set-separator-history, csv--preferred-separators): New variables. (csv-set-separator, csv-guess-set-separator) (csv-guess-separator, csv--separator-candidates) (csv--separator-score): New functions. * csv-mode-tests.el (csv-tests--data): New test data. (csv-tests-guess-separator, csv-tests-separator-candidates) (csv-tests-separator-score): New tests. --- csv-mode-tests.el | 80 ++++++++++++++++++++------- csv-mode.el | 138 +++++++++++++++++++++++++++++++++++++++++++--- 2 files changed, 188 insertions(+), 30 deletions(-) diff --git a/csv-mode-tests.el b/csv-mode-tests.el index 316dc4bb93..0caeab7d80 100644 --- a/csv-mode-tests.el +++ b/csv-mode-tests.el @@ -1,8 +1,8 @@ ;;; csv-mode-tests.el --- Tests for CSV mode -*- lexical-binding: = t; -*- =20 -;; Copyright (C) 2020 Free Software Foundation, Inc +;; Copyright (C) 2020-2022 Free Software Foundation, Inc =20 -;; Author: Simen Heggest=C3=B8yl +;; Author: Simen Heggest=C3=B8yl ;; Keywords: =20 ;; This program is free software; you can redistribute it and/or modify @@ -28,83 +28,121 @@ (require 'csv-mode) (eval-when-compile (require 'subr-x)) =20 -(ert-deftest csv-mode-tests-end-of-field () +(ert-deftest csv-tests-end-of-field () (with-temp-buffer (csv-mode) (insert "aaa,bbb") (goto-char (point-min)) (csv-end-of-field) - (should (equal (buffer-substring (point-min) (point)) - "aaa")) + (should (equal (buffer-substring (point-min) (point)) "aaa")) (forward-char) (csv-end-of-field) (should (equal (buffer-substring (point-min) (point)) "aaa,bbb")))) =20 -(ert-deftest csv-mode-tests-end-of-field-with-quotes () +(ert-deftest csv-tests-end-of-field-with-quotes () (with-temp-buffer (csv-mode) (insert "aaa,\"b,b\"") (goto-char (point-min)) (csv-end-of-field) - (should (equal (buffer-substring (point-min) (point)) - "aaa")) + (should (equal (buffer-substring (point-min) (point)) "aaa")) (forward-char) (csv-end-of-field) (should (equal (buffer-substring (point-min) (point)) "aaa,\"b,b\"")))) =20 -(ert-deftest csv-mode-tests-beginning-of-field () +(ert-deftest csv-tests-beginning-of-field () (with-temp-buffer (csv-mode) (insert "aaa,bbb") (csv-beginning-of-field) - (should (equal (buffer-substring (point) (point-max)) - "bbb")) + (should (equal (buffer-substring (point) (point-max)) "bbb")) (backward-char) (csv-beginning-of-field) (should (equal (buffer-substring (point) (point-max)) "aaa,bbb")))) =20 -(ert-deftest csv-mode-tests-beginning-of-field-with-quotes () +(ert-deftest csv-tests-beginning-of-field-with-quotes () (with-temp-buffer (csv-mode) (insert "aaa,\"b,b\"") (csv-beginning-of-field) - (should (equal (buffer-substring (point) (point-max)) - "\"b,b\"")) + (should (equal (buffer-substring (point) (point-max)) "\"b,b\"")) (backward-char) (csv-beginning-of-field) (should (equal (buffer-substring (point) (point-max)) "aaa,\"b,b\"")))) =20 -(defun csv-mode-tests--align-fields (before after) +(defun csv-tests--align-fields (before after) (with-temp-buffer (insert (string-join before "\n")) (csv-align-fields t (point-min) (point-max)) (should (equal (buffer-string) (string-join after "\n"))))) =20 -(ert-deftest csv-mode-tests-align-fields () - (csv-mode-tests--align-fields +(ert-deftest csv-tests-align-fields () + (csv-tests--align-fields '("aaa,bbb,ccc" "1,2,3") '("aaa, bbb, ccc" "1 , 2 , 3"))) =20 -(ert-deftest csv-mode-tests-align-fields-with-quotes () - (csv-mode-tests--align-fields +(ert-deftest csv-tests-align-fields-with-quotes () + (csv-tests--align-fields '("aaa,\"b,b\",ccc" "1,2,3") '("aaa, \"b,b\", ccc" "1 , 2 , 3"))) =20 ;; Bug#14053 -(ert-deftest csv-mode-tests-align-fields-double-quote-comma () - (csv-mode-tests--align-fields +(ert-deftest csv-tests-align-fields-double-quote-comma () + (csv-tests--align-fields '("1,2,3" "a,\"b\"\"c,\",d") '("1, 2 , 3" "a, \"b\"\"c,\", d"))) =20 +(defvar csv-tests--data + "1,4;Sun, 2022-04-10;4,12 +8;Mon, 2022-04-11;3,19 +3,2;Tue, 2022-04-12;1,00 +2;Wed, 2022-04-13;0,37 +9;Wed, 2022-04-13;0,37") + +(ert-deftest csv-tests-guess-separator () + (should-not (csv-guess-separator "")) + (should (=3D (csv-guess-separator csv-tests--data 3) ?,)) + (should (=3D (csv-guess-separator csv-tests--data) ?\;)) + (should (=3D (csv-guess-separator csv-tests--data) + (csv-guess-separator csv-tests--data + (length csv-tests--data))))) + +(ert-deftest csv-tests-separator-candidates () + (should-not (csv--separator-candidates "")) + (should-not (csv--separator-candidates csv-tests--data 0)) + (should + (equal (sort (csv--separator-candidates csv-tests--data 4) #'<) + '(?, ?\;))) + (should + (equal (sort (csv--separator-candidates csv-tests--data) #'<) + '(?\s ?, ?- ?\;))) + (should + (equal + (sort (csv--separator-candidates csv-tests--data) #'<) + (sort (csv--separator-candidates csv-tests--data + (length csv-tests--data)) + #'<)))) + +(ert-deftest csv-tests-separator-score () + (should (< (csv--separator-score ?, csv-tests--data) + (csv--separator-score ?\s csv-tests--data) + (csv--separator-score ?- csv-tests--data))) + (should (=3D (csv--separator-score ?- csv-tests--data) + (csv--separator-score ?\; csv-tests--data))) + (should (=3D 0 (csv--separator-score ?\; csv-tests--data 0))) + (should (=3D (csv--separator-score ?\; csv-tests--data) + (csv--separator-score ?\; csv-tests--data + (length csv-tests--data))))) + (provide 'csv-mode-tests) ;;; csv-mode-tests.el ends here diff --git a/csv-mode.el b/csv-mode.el index 10ce166052..f31f0da1f5 100644 --- a/csv-mode.el +++ b/csv-mode.el @@ -1,11 +1,11 @@ ;;; csv-mode.el --- Major mode for editing comma/char separated values -*= - lexical-binding: t -*- =20 -;; Copyright (C) 2003, 2004, 2012-2020 Free Software Foundation, Inc +;; Copyright (C) 2003, 2004, 2012-2022 Free Software Foundation, Inc =20 ;; Author: "Francis J. Wright" ;; Maintainer: emacs-devel@gnu.org ;; Version: 1.19 -;; Package-Requires: ((emacs "24.1") (cl-lib "0.5")) +;; Package-Requires: ((emacs "27.1") (cl-lib "0.5")) ;; Keywords: convenience =20 ;; This package is free software; you can redistribute it and/or modify @@ -119,7 +119,9 @@ =20 ;;; Code: =20 -(eval-when-compile (require 'cl-lib)) +(eval-when-compile + (require 'cl-lib) + (require 'subr-x)) =20 (defgroup CSV nil "Major mode for editing files of comma-separated value type." @@ -163,12 +165,14 @@ session. Use `customize-set-variable' instead if tha= t is required." (error "%S is already a quote" x))) value) (custom-set-default variable value) - (setq csv-separator-chars (mapcar #'string-to-char value) - csv--skip-chars (apply #'concat "^\n" csv-separators) - csv-separator-regexp (apply #'concat `("[" ,@value "]")) - csv-font-lock-keywords - ;; NB: csv-separator-face variable evaluates to itself. - `((,csv-separator-regexp (0 'csv-separator-face)))))) + (setq csv-separator-chars (mapcar #'string-to-char value)) + (let ((quoted-value (mapcar #'regexp-quote value))) + (setq csv--skip-chars (apply #'concat "^\n" quoted-value)) + (setq csv-separator-regexp + (apply #'concat `("[" ,@quoted-value "]")))) + (setq csv-font-lock-keywords + ;; NB: csv-separator-face variable evaluates to itself. + `((,csv-separator-regexp (0 'csv-separator-face)))))) =20 (defcustom csv-field-quotes '("\"") "Field quotes: a list of *single-character* strings. @@ -368,6 +372,24 @@ It must be either a string or nil." (modify-syntax-entry ?\n ">" csv-mode-syntax-table)) (setq csv-comment-start string)) =20 +(defvar csv--set-separator-history nil) + +(defun csv-set-separator (sep) + "Set the CSV separator in the current buffer to SEP." + (interactive (list (read-char-from-minibuffer + "Separator: " nil 'csv--set-separator-history))) + (when (and (boundp 'csv-field-quotes) + (member (string sep) csv-field-quotes)) + (error "%c is already a quote" sep)) + (setq-local csv-separators (list (string sep))) + (setq-local csv-separator-chars (list sep)) + (let ((quoted-sep (regexp-quote (string sep)))) + (setq-local csv--skip-chars (format "^\n%s" quoted-sep)) + (setq-local csv-separator-regexp (format "[%s]" quoted-sep))) + (setq-local csv-font-lock-keywords + `((,csv-separator-regexp (0 'csv-separator-face)))) + (font-lock-refresh-defaults)) + ;;;###autoload (add-to-list 'auto-mode-alist '("\\.[Cc][Ss][Vv]\\'" . csv-mode)) =20 @@ -1728,6 +1750,104 @@ setting works better)." (jit-lock-unregister #'csv--jit-align) (csv--jit-unalign (point-min) (point-max)))) (csv--header-flush)) + +;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +;;; Separator guessing +;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + +(defvar csv--preferred-separators + '(?\t ?\s ?, ?: ?\;) + "Preferred separator characters in case of a tied score.") + +(defun csv-guess-set-separator () + "Guess and set the CSV separator of the current buffer. + +Add it to the mode hook to have CSV mode guess and set the +separator automatically when visiting a buffer: + + (add-hook \\=3D'csv-mode-hook \\=3D'csv-guess-set-separator)" + (interactive) + (let ((sep (csv-guess-separator + (buffer-substring-no-properties + (point-min) + ;; We're probably only going to look at the first 2048 + ;; or so chars, but take more than we probably need to + ;; minimize the chance of breaking the input in the + ;; middle of a (long) row. + (min 8192 (point-max))) + 2048))) + (when sep + (csv-set-separator sep)))) + +(defun csv-guess-separator (text &optional cutoff) + "Return a guess of which character is the CSV separator in TEXT." + (let ((best-separator nil) + (best-score 0)) + (dolist (candidate (csv--separator-candidates text cutoff)) + (let ((candidate-score + (csv--separator-score candidate text cutoff))) + (when (or (> candidate-score best-score) + (and (=3D candidate-score best-score) + (member candidate csv--preferred-separators))) + (setq best-separator candidate) + (setq best-score candidate-score)))) + best-separator)) + +(defun csv--separator-candidates (text &optional cutoff) + "Return a list of candidate CSV separators in TEXT. +When CUTOFF is passed, look only at the first CUTOFF number of characters." + (let ((chars (make-hash-table))) + (dolist (c (string-to-list + (if cutoff + (substring text 0 (min cutoff (length text))) + text))) + (when (and (not (gethash c chars)) + (or (=3D c ?\t) + (and (not (member c '(?. ?/ ?\" ?'))) + (not (member (get-char-code-property c 'general-= category) + '(Lu Ll Lt Lm Lo Nd Nl No Ps Pe Cc = Co)))))) + (puthash c t chars))) + (hash-table-keys chars))) + +(defun csv--separator-score (separator text &optional cutoff) + "Return a score on how likely SEPARATOR is a separator in TEXT. + +When CUTOFF is passed, stop the calculation at the next whole +line after having read CUTOFF number of characters. + +The scoring is based on the idea that most CSV data is tabular, +i.e. separators should appear equally often on each line. +Furthermore, more commonly appearing characters are scored higher +than those who appear less often. + +Adapted from the paper \"Wrangling Messy CSV Files by Detecting +Row and Type Patterns\" by Gerrit J.J. van den Burg , Alfredo +Naz=C3=A1bal, and Charles Sutton: https://arxiv.org/abs/1811.11242." + (let ((groups + (with-temp-buffer + (csv-set-separator separator) + (save-excursion + (insert text)) + (let ((groups (make-hash-table)) + (chars-read 0)) + (while (and (/=3D (point) (point-max)) + (or (not cutoff) + (< chars-read cutoff))) + (let* ((lep (line-end-position)) + (nfields (length (csv--collect-fields lep)))) + (cl-incf (gethash nfields groups 0)) + (cl-incf chars-read (- lep (point))) + (goto-char (+ lep 1)))) + groups))) + (sum 0)) + (maphash + (lambda (length num) + (cl-incf sum (* num (/ (- length 1) (float length))))) + groups) + (let ((unique-groups (hash-table-count groups))) + (if (=3D 0 unique-groups) + 0 + (/ sum unique-groups))))) =20 ;;; TSV support =20 --=20 2.35.1 --=-=-=-- From unknown Mon Jun 23 15:02:05 2025 X-Loop: help-debbugs@gnu.org Subject: bug#55315: [elpa/csv-mode] [PATCH] CSV separator guessing References: <87h760jeq7.fsf@simenheg@gmail.com> In-Reply-To: <87h760jeq7.fsf@simenheg@gmail.com> Resent-From: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Sun, 08 May 2022 17:57:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 55315 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: patch To: Simen =?UTF-8?Q?Heggest=C3=B8yl?= Cc: 55315@debbugs.gnu.org Received: via spool by 55315-submit@debbugs.gnu.org id=B55315.165203257020051 (code B ref 55315); Sun, 08 May 2022 17:57:02 +0000 Received: (at 55315) by debbugs.gnu.org; 8 May 2022 17:56:10 +0000 Received: from localhost ([127.0.0.1]:55191 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1nnl8c-0005DL-18 for submit@debbugs.gnu.org; Sun, 08 May 2022 13:56:10 -0400 Received: from mail18c50.megamailservers.eu ([91.136.10.28]:32776) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1nnl8X-0005D7-C8 for 55315@debbugs.gnu.org; Sun, 08 May 2022 13:56:08 -0400 X-Authenticated-User: mattiase@bredband.net DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=megamailservers.eu; s=maildub; t=1652032563; bh=wMU4Q1IuvkxqwwIGFht7qJUwvzqZZZYwoW7HOo/E2oI=; h=From:Subject:Date:Cc:To:From; b=KlmmhEN6+w9TdIcBta0S0Gt4cV9m6irqP+HM8x6Zji7FyJ+D3DP7rOYh7FwdOMY6l dKqzbo9Ezmgnm2614fSyV3SIft+S0q4i3z8QVTjAn6FLhXF3EHt+478WVbF5NmbVva FnLToKPeoZk96mRZDkXUwgTkhMfZm2DxZL6xsBrg= Feedback-ID: mattiase@acm.or Received: from smtpclient.apple (c188-150-171-71.bredband.tele2.se [188.150.171.71]) (authenticated bits=0) by mail18c50.megamailservers.eu (8.14.9/8.13.1) with ESMTP id 248Hu1Wh018310; Sun, 8 May 2022 17:56:02 +0000 From: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.120.0.1.13\)) Message-Id: <07E204D4-5FE4-4122-BB82-EBB2107C09E8@acm.org> Date: Sun, 8 May 2022 19:56:00 +0200 X-Mailer: Apple Mail (2.3654.120.0.1.13) X-CTCH-RefID: str=0001.0A742F27.62780433.0011, ss=1, re=0.000, recu=0.000, reip=0.000, cl=1, cld=1, fgs=0 X-CTCH-VOD: Unknown X-CTCH-Spam: Unknown X-CTCH-Score: 0.000 X-CTCH-Rules: X-CTCH-Flags: 0 X-CTCH-ScoreCust: 0.000 X-Origin-Country: SE X-Spam-Score: 1.0 (+) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) > + (setq csv-separator-chars (mapcar #'string-to-char value)) > + (let ((quoted-value (mapcar #'regexp-quote value))) > + (setq csv--skip-chars (apply #'concat "^\n" quoted-value)) > + (setq csv-separator-regexp > + (apply #'concat `("[" ,@quoted-value "]")))) `regexp-quote` produces a regexp from a string literal, but what goes = inside the square brackets is not a regexp -- the syntax rules are = different. More specifically, other characters are special, and = backslash does not quote anything. To produce a regexp that matches one in a set of characters, try = rx-to-string or regexp-opt. For example, (setq csv-separator-regexp (rx-to-string `(or ,@csv-separator-chars) t)) The same applies to csv--skip-chars: this isn't a regexp either, but = uses yet another syntax so regexp-quote is inappropriate here too. = Easiest is to precede each char with a backslash since that always = yields a correctly quoted character: "ABC" -> "\\A\\B\\C". This is not a judgement on the rest of the patch which may be fine for = all I know. From unknown Mon Jun 23 15:02:05 2025 X-Loop: help-debbugs@gnu.org Subject: bug#55315: [elpa/csv-mode] [PATCH] CSV separator guessing Resent-From: Simen =?UTF-8?Q?Heggest=C3=B8yl?= Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Sun, 08 May 2022 19:32:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 55315 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: patch To: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= Cc: 55315@debbugs.gnu.org Received: via spool by 55315-submit@debbugs.gnu.org id=B55315.16520382905307 (code B ref 55315); Sun, 08 May 2022 19:32:02 +0000 Received: (at 55315) by debbugs.gnu.org; 8 May 2022 19:31:30 +0000 Received: from localhost ([127.0.0.1]:55247 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1nnmcr-0001NX-Jd for submit@debbugs.gnu.org; Sun, 08 May 2022 15:31:30 -0400 Received: from mailtransmit04.runbox.com ([185.226.149.37]:34530) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1nnmcp-0001NH-4e for 55315@debbugs.gnu.org; Sun, 08 May 2022 15:31:28 -0400 Received: from mailtransmit03.runbox ([10.9.9.163] helo=aibo.runbox.com) by mailtransmit04.runbox.com with esmtps (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.93) (envelope-from ) id 1nnmci-007azT-9h; Sun, 08 May 2022 21:31:20 +0200 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=runbox.com; s=selector2; h=Content-Type:MIME-Version:Message-ID:In-Reply-To:Date: References:Subject:Cc:To:From; bh=y9hO1VMN+NX9YbVfxkolrt2dcv33bEuvXvNTuWNv8GI=; b=dfUiR/NwbaF7/OUhPtJ09eJAoq 2rWfVevR0voYGNcduJtcv+TdmP5c4dR5kzimqSk67/H3MrqOI9YNKFPLfT6KVv6Rc+4pvoAMqsRZw Hz98qmi7u3gz76dY36cf6zswVucsKDizIj+1CSJeaumoc2hU8sRIPzDRrhat3Lf0T8cyqhl8M+JTZ Cwbeq/AJ8gXD0Pcvn3gGyOmbacHtOrhA5UdH19kDe074t0DgWDU6pfXHNMb3uyfTrT2xwXWkoqy7a 2fUMd71KhPRbYW33gKsnB7jI0ZvDut4ey2wvA4fEGZJ7+ErE44E3V7G1A+EIRzPUlKgSzLy+ZO6QT nsbl/J+w==; Received: from [10.9.9.72] (helo=submission01.runbox) by mailtransmit03.runbox with esmtp (Exim 4.86_2) (envelope-from ) id 1nnmch-00064F-Sa; Sun, 08 May 2022 21:31:20 +0200 Received: by submission01.runbox with esmtpsa [Authenticated ID (963757)] (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) id 1nnmca-0004O1-SM; Sun, 08 May 2022 21:31:12 +0200 From: Simen =?UTF-8?Q?Heggest=C3=B8yl?= References: <07E204D4-5FE4-4122-BB82-EBB2107C09E8@acm.org> Date: Sun, 08 May 2022 21:31:12 +0200 In-Reply-To: <07E204D4-5FE4-4122-BB82-EBB2107C09E8@acm.org> ("Mattias =?UTF-8?Q?Engdeg=C3=A5rd?="'s message of "Sun, 8 May 2022 19:56:00 +0200") Message-ID: <87mtfryg73.fsf@simenheg@gmail.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.1 (gnu/linux) MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-Spam-Score: 1.0 (+) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Mattias Engdeg=C3=A5rd writes: >> + (setq csv-separator-chars (mapcar #'string-to-char value)) >> + (let ((quoted-value (mapcar #'regexp-quote value))) >> + (setq csv--skip-chars (apply #'concat "^\n" quoted-value)) >> + (setq csv-separator-regexp >> + (apply #'concat `("[" ,@quoted-value "]")))) > > `regexp-quote` produces a regexp from a string literal, but what goes > inside the square brackets is not a regexp -- the syntax rules are > different. More specifically, other characters are special, and > backslash does not quote anything. > > To produce a regexp that matches one in a set of characters, try rx-to-st= ring or regexp-opt. For example, > > (setq csv-separator-regexp (rx-to-string `(or ,@csv-separator-chars) t)) > > The same applies to csv--skip-chars: this isn't a regexp either, but > uses yet another syntax so regexp-quote is inappropriate here > too. Easiest is to precede each char with a backslash since that > always yields a correctly quoted character: "ABC" -> "\\A\\B\\C". > > This is not a judgement on the rest of the patch which may be fine for al= l I know. Thanks Mattias. Does it look better in the updated patch attached? Note that `csv--skip-chars' and `csv-separator-regexp' are set in two different places in the patch, the first time from a list of strings in `csv-separators', and the second time from a single character in `csv-set-separator'. Am I right in thinking that the use of `regexp-quote' in the `csv-set-separator' case gives the right result? -- Simen --=-=-= Content-Type: text/x-diff; charset=utf-8 Content-Disposition: attachment; filename=0001-Add-CSV-separator-guessing-functionality.patch Content-Transfer-Encoding: quoted-printable >From e498ab88ffe8468d791a10c50b692a926a2341ea Mon Sep 17 00:00:00 2001 From: =3D?UTF-8?q?Simen=3D20Heggest=3DC3=3DB8yl?=3D Date: Sun, 8 May 2022 16:01:35 +0200 Subject: [PATCH] Add CSV separator guessing functionality Add two new commands: `csv-guess-set-separator' that automatically guesses and sets the CSV separator of the current buffer, and `csv-set-separator' for setting it manually. `csv-guess-set-separator' can be useful to add to the mode hook to have CSV mode guess and set the separator automatically when visiting a buffer: (add-hook 'csv-mode-hook 'csv-guess-set-separator) * csv-mode.el (csv-separators): Properly quote regexp values. (csv--set-separator-history, csv--preferred-separators): New variables. (csv-set-separator, csv-guess-set-separator) (csv-guess-separator, csv--separator-candidates) (csv--separator-score): New functions. * csv-mode-tests.el (csv-tests--data): New test data. (csv-tests-guess-separator, csv-tests-separator-candidates) (csv-tests-separator-score): New tests. --- csv-mode-tests.el | 80 ++++++++++++++++++++------- csv-mode.el | 137 +++++++++++++++++++++++++++++++++++++++++++--- 2 files changed, 187 insertions(+), 30 deletions(-) diff --git a/csv-mode-tests.el b/csv-mode-tests.el index 316dc4bb93..0caeab7d80 100644 --- a/csv-mode-tests.el +++ b/csv-mode-tests.el @@ -1,8 +1,8 @@ ;;; csv-mode-tests.el --- Tests for CSV mode -*- lexical-binding: = t; -*- =20 -;; Copyright (C) 2020 Free Software Foundation, Inc +;; Copyright (C) 2020-2022 Free Software Foundation, Inc =20 -;; Author: Simen Heggest=C3=B8yl +;; Author: Simen Heggest=C3=B8yl ;; Keywords: =20 ;; This program is free software; you can redistribute it and/or modify @@ -28,83 +28,121 @@ (require 'csv-mode) (eval-when-compile (require 'subr-x)) =20 -(ert-deftest csv-mode-tests-end-of-field () +(ert-deftest csv-tests-end-of-field () (with-temp-buffer (csv-mode) (insert "aaa,bbb") (goto-char (point-min)) (csv-end-of-field) - (should (equal (buffer-substring (point-min) (point)) - "aaa")) + (should (equal (buffer-substring (point-min) (point)) "aaa")) (forward-char) (csv-end-of-field) (should (equal (buffer-substring (point-min) (point)) "aaa,bbb")))) =20 -(ert-deftest csv-mode-tests-end-of-field-with-quotes () +(ert-deftest csv-tests-end-of-field-with-quotes () (with-temp-buffer (csv-mode) (insert "aaa,\"b,b\"") (goto-char (point-min)) (csv-end-of-field) - (should (equal (buffer-substring (point-min) (point)) - "aaa")) + (should (equal (buffer-substring (point-min) (point)) "aaa")) (forward-char) (csv-end-of-field) (should (equal (buffer-substring (point-min) (point)) "aaa,\"b,b\"")))) =20 -(ert-deftest csv-mode-tests-beginning-of-field () +(ert-deftest csv-tests-beginning-of-field () (with-temp-buffer (csv-mode) (insert "aaa,bbb") (csv-beginning-of-field) - (should (equal (buffer-substring (point) (point-max)) - "bbb")) + (should (equal (buffer-substring (point) (point-max)) "bbb")) (backward-char) (csv-beginning-of-field) (should (equal (buffer-substring (point) (point-max)) "aaa,bbb")))) =20 -(ert-deftest csv-mode-tests-beginning-of-field-with-quotes () +(ert-deftest csv-tests-beginning-of-field-with-quotes () (with-temp-buffer (csv-mode) (insert "aaa,\"b,b\"") (csv-beginning-of-field) - (should (equal (buffer-substring (point) (point-max)) - "\"b,b\"")) + (should (equal (buffer-substring (point) (point-max)) "\"b,b\"")) (backward-char) (csv-beginning-of-field) (should (equal (buffer-substring (point) (point-max)) "aaa,\"b,b\"")))) =20 -(defun csv-mode-tests--align-fields (before after) +(defun csv-tests--align-fields (before after) (with-temp-buffer (insert (string-join before "\n")) (csv-align-fields t (point-min) (point-max)) (should (equal (buffer-string) (string-join after "\n"))))) =20 -(ert-deftest csv-mode-tests-align-fields () - (csv-mode-tests--align-fields +(ert-deftest csv-tests-align-fields () + (csv-tests--align-fields '("aaa,bbb,ccc" "1,2,3") '("aaa, bbb, ccc" "1 , 2 , 3"))) =20 -(ert-deftest csv-mode-tests-align-fields-with-quotes () - (csv-mode-tests--align-fields +(ert-deftest csv-tests-align-fields-with-quotes () + (csv-tests--align-fields '("aaa,\"b,b\",ccc" "1,2,3") '("aaa, \"b,b\", ccc" "1 , 2 , 3"))) =20 ;; Bug#14053 -(ert-deftest csv-mode-tests-align-fields-double-quote-comma () - (csv-mode-tests--align-fields +(ert-deftest csv-tests-align-fields-double-quote-comma () + (csv-tests--align-fields '("1,2,3" "a,\"b\"\"c,\",d") '("1, 2 , 3" "a, \"b\"\"c,\", d"))) =20 +(defvar csv-tests--data + "1,4;Sun, 2022-04-10;4,12 +8;Mon, 2022-04-11;3,19 +3,2;Tue, 2022-04-12;1,00 +2;Wed, 2022-04-13;0,37 +9;Wed, 2022-04-13;0,37") + +(ert-deftest csv-tests-guess-separator () + (should-not (csv-guess-separator "")) + (should (=3D (csv-guess-separator csv-tests--data 3) ?,)) + (should (=3D (csv-guess-separator csv-tests--data) ?\;)) + (should (=3D (csv-guess-separator csv-tests--data) + (csv-guess-separator csv-tests--data + (length csv-tests--data))))) + +(ert-deftest csv-tests-separator-candidates () + (should-not (csv--separator-candidates "")) + (should-not (csv--separator-candidates csv-tests--data 0)) + (should + (equal (sort (csv--separator-candidates csv-tests--data 4) #'<) + '(?, ?\;))) + (should + (equal (sort (csv--separator-candidates csv-tests--data) #'<) + '(?\s ?, ?- ?\;))) + (should + (equal + (sort (csv--separator-candidates csv-tests--data) #'<) + (sort (csv--separator-candidates csv-tests--data + (length csv-tests--data)) + #'<)))) + +(ert-deftest csv-tests-separator-score () + (should (< (csv--separator-score ?, csv-tests--data) + (csv--separator-score ?\s csv-tests--data) + (csv--separator-score ?- csv-tests--data))) + (should (=3D (csv--separator-score ?- csv-tests--data) + (csv--separator-score ?\; csv-tests--data))) + (should (=3D 0 (csv--separator-score ?\; csv-tests--data 0))) + (should (=3D (csv--separator-score ?\; csv-tests--data) + (csv--separator-score ?\; csv-tests--data + (length csv-tests--data))))) + (provide 'csv-mode-tests) ;;; csv-mode-tests.el ends here diff --git a/csv-mode.el b/csv-mode.el index 10ce166052..9fd5fc8f10 100644 --- a/csv-mode.el +++ b/csv-mode.el @@ -1,11 +1,11 @@ ;;; csv-mode.el --- Major mode for editing comma/char separated values -*= - lexical-binding: t -*- =20 -;; Copyright (C) 2003, 2004, 2012-2020 Free Software Foundation, Inc +;; Copyright (C) 2003, 2004, 2012-2022 Free Software Foundation, Inc =20 ;; Author: "Francis J. Wright" ;; Maintainer: emacs-devel@gnu.org ;; Version: 1.19 -;; Package-Requires: ((emacs "24.1") (cl-lib "0.5")) +;; Package-Requires: ((emacs "27.1") (cl-lib "0.5")) ;; Keywords: convenience =20 ;; This package is free software; you can redistribute it and/or modify @@ -119,7 +119,9 @@ =20 ;;; Code: =20 -(eval-when-compile (require 'cl-lib)) +(eval-when-compile + (require 'cl-lib) + (require 'subr-x)) =20 (defgroup CSV nil "Major mode for editing files of comma-separated value type." @@ -163,12 +165,14 @@ session. Use `customize-set-variable' instead if tha= t is required." (error "%S is already a quote" x))) value) (custom-set-default variable value) - (setq csv-separator-chars (mapcar #'string-to-char value) - csv--skip-chars (apply #'concat "^\n" csv-separators) - csv-separator-regexp (apply #'concat `("[" ,@value "]")) - csv-font-lock-keywords - ;; NB: csv-separator-face variable evaluates to itself. - `((,csv-separator-regexp (0 'csv-separator-face)))))) + (setq csv-separator-chars (mapcar #'string-to-char value)) + (setq csv--skip-chars + (apply #'concat "^\n" + (mapcar (lambda (s) (concat "\\" s)) value))) + (setq csv-separator-regexp (regexp-opt value)) + (setq csv-font-lock-keywords + ;; NB: csv-separator-face variable evaluates to itself. + `((,csv-separator-regexp (0 'csv-separator-face)))))) =20 (defcustom csv-field-quotes '("\"") "Field quotes: a list of *single-character* strings. @@ -368,6 +372,23 @@ It must be either a string or nil." (modify-syntax-entry ?\n ">" csv-mode-syntax-table)) (setq csv-comment-start string)) =20 +(defvar csv--set-separator-history nil) + +(defun csv-set-separator (sep) + "Set the CSV separator in the current buffer to SEP." + (interactive (list (read-char-from-minibuffer + "Separator: " nil 'csv--set-separator-history))) + (when (and (boundp 'csv-field-quotes) + (member (string sep) csv-field-quotes)) + (error "%c is already a quote" sep)) + (setq-local csv-separators (list (string sep))) + (setq-local csv-separator-chars (list sep)) + (setq-local csv--skip-chars (format "^\n%c" sep)) + (setq-local csv-separator-regexp (regexp-quote (string sep))) + (setq-local csv-font-lock-keywords + `((,csv-separator-regexp (0 'csv-separator-face)))) + (font-lock-refresh-defaults)) + ;;;###autoload (add-to-list 'auto-mode-alist '("\\.[Cc][Ss][Vv]\\'" . csv-mode)) =20 @@ -1728,6 +1749,104 @@ setting works better)." (jit-lock-unregister #'csv--jit-align) (csv--jit-unalign (point-min) (point-max)))) (csv--header-flush)) + +;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +;;; Separator guessing +;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + +(defvar csv--preferred-separators + '(?\t ?\s ?, ?: ?\;) + "Preferred separator characters in case of a tied score.") + +(defun csv-guess-set-separator () + "Guess and set the CSV separator of the current buffer. + +Add it to the mode hook to have CSV mode guess and set the +separator automatically when visiting a buffer: + + (add-hook \\=3D'csv-mode-hook \\=3D'csv-guess-set-separator)" + (interactive) + (let ((sep (csv-guess-separator + (buffer-substring-no-properties + (point-min) + ;; We're probably only going to look at the first 2048 + ;; or so chars, but take more than we probably need to + ;; minimize the chance of breaking the input in the + ;; middle of a (long) row. + (min 8192 (point-max))) + 2048))) + (when sep + (csv-set-separator sep)))) + +(defun csv-guess-separator (text &optional cutoff) + "Return a guess of which character is the CSV separator in TEXT." + (let ((best-separator nil) + (best-score 0)) + (dolist (candidate (csv--separator-candidates text cutoff)) + (let ((candidate-score + (csv--separator-score candidate text cutoff))) + (when (or (> candidate-score best-score) + (and (=3D candidate-score best-score) + (member candidate csv--preferred-separators))) + (setq best-separator candidate) + (setq best-score candidate-score)))) + best-separator)) + +(defun csv--separator-candidates (text &optional cutoff) + "Return a list of candidate CSV separators in TEXT. +When CUTOFF is passed, look only at the first CUTOFF number of characters." + (let ((chars (make-hash-table))) + (dolist (c (string-to-list + (if cutoff + (substring text 0 (min cutoff (length text))) + text))) + (when (and (not (gethash c chars)) + (or (=3D c ?\t) + (and (not (member c '(?. ?/ ?\" ?'))) + (not (member (get-char-code-property c 'general-= category) + '(Lu Ll Lt Lm Lo Nd Nl No Ps Pe Cc = Co)))))) + (puthash c t chars))) + (hash-table-keys chars))) + +(defun csv--separator-score (separator text &optional cutoff) + "Return a score on how likely SEPARATOR is a separator in TEXT. + +When CUTOFF is passed, stop the calculation at the next whole +line after having read CUTOFF number of characters. + +The scoring is based on the idea that most CSV data is tabular, +i.e. separators should appear equally often on each line. +Furthermore, more commonly appearing characters are scored higher +than those who appear less often. + +Adapted from the paper \"Wrangling Messy CSV Files by Detecting +Row and Type Patterns\" by Gerrit J.J. van den Burg , Alfredo +Naz=C3=A1bal, and Charles Sutton: https://arxiv.org/abs/1811.11242." + (let ((groups + (with-temp-buffer + (csv-set-separator separator) + (save-excursion + (insert text)) + (let ((groups (make-hash-table)) + (chars-read 0)) + (while (and (/=3D (point) (point-max)) + (or (not cutoff) + (< chars-read cutoff))) + (let* ((lep (line-end-position)) + (nfields (length (csv--collect-fields lep)))) + (cl-incf (gethash nfields groups 0)) + (cl-incf chars-read (- lep (point))) + (goto-char (+ lep 1)))) + groups))) + (sum 0)) + (maphash + (lambda (length num) + (cl-incf sum (* num (/ (- length 1) (float length))))) + groups) + (let ((unique-groups (hash-table-count groups))) + (if (=3D 0 unique-groups) + 0 + (/ sum unique-groups))))) =20 ;;; TSV support =20 --=20 2.35.1 --=-=-=-- From unknown Mon Jun 23 15:02:05 2025 X-Loop: help-debbugs@gnu.org Subject: bug#55315: [elpa/csv-mode] [PATCH] CSV separator guessing Resent-From: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Mon, 09 May 2022 09:38:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 55315 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: patch To: Simen =?UTF-8?Q?Heggest=C3=B8yl?= Cc: 55315@debbugs.gnu.org Received: via spool by 55315-submit@debbugs.gnu.org id=B55315.165208907310429 (code B ref 55315); Mon, 09 May 2022 09:38:01 +0000 Received: (at 55315) by debbugs.gnu.org; 9 May 2022 09:37:53 +0000 Received: from localhost ([127.0.0.1]:56113 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1nnzpx-0002i9-9Q for submit@debbugs.gnu.org; Mon, 09 May 2022 05:37:53 -0400 Received: from mail1450c50.megamailservers.eu ([91.136.14.50]:60228 helo=mail265c50.megamailservers.eu) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1nnzpv-0002hu-Cw for 55315@debbugs.gnu.org; Mon, 09 May 2022 05:37:52 -0400 X-Authenticated-User: mattiase@bredband.net DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=megamailservers.eu; s=maildub; t=1652089064; bh=umwvtGEehlk18QFvM6gi5PQNkCzLdAOnqmcGWgpgu/s=; h=Subject:From:In-Reply-To:Date:Cc:References:To:From; b=Q5nG37bvg20/gBs87EISkUkYxEdHi9xqFLg+wQsdtebJV3gh2XVLzFkWWGNXebdzX 7HQYOzuelxvyckvLCp+57iMO5QuEsMr64+FFGSFM7O/3tGwdmcWPgdKcZNH+hmJolP omIM757UEgQ8w5CGssuLwaHbw2cePtx+JiPJ4lvk= Feedback-ID: mattiase@acm.or Received: from smtpclient.apple (c188-150-171-71.bredband.tele2.se [188.150.171.71]) (authenticated bits=0) by mail265c50.megamailservers.eu (8.14.9/8.13.1) with ESMTP id 2499bgZW027781; Mon, 9 May 2022 09:37:43 +0000 Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.120.0.1.13\)) From: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= In-Reply-To: <87mtfryg73.fsf@simenheg@gmail.com> Date: Mon, 9 May 2022 11:37:41 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <288326F0-A0EA-4CCE-B1E7-C8184255B046@acm.org> References: <07E204D4-5FE4-4122-BB82-EBB2107C09E8@acm.org> <87mtfryg73.fsf@simenheg@gmail.com> X-Mailer: Apple Mail (2.3654.120.0.1.13) X-CTCH-RefID: str=0001.0A742F2A.6278E0E8.002E, ss=1, re=0.000, recu=0.000, reip=0.000, cl=1, cld=1, fgs=0 X-CTCH-VOD: Unknown X-CTCH-Spam: Unknown X-CTCH-Score: 0.000 X-CTCH-Rules: X-CTCH-Flags: 0 X-CTCH-ScoreCust: 0.000 X-Origin-Country: SE X-Spam-Score: 1.3 (+) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has NOT identified this incoming email as spam. The original message has been attached to this so you can view it or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: 8 maj 2022 kl. 21.31 skrev Simen =?UTF-8?Q?Heggest=C3=B8yl?= : > Am I right in thinking that the use of > `regexp-quote' in the `csv-set-separator' case gives the right result? Yes, I think so. `csv-set-separator` should probably escape the character in `csv--skip-chars`, however: Content analysis details: (1.3 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- 0.0 SPF_HELO_NONE SPF: HELO does not publish an SPF Record 1.0 SPF_SOFTFAIL SPF: sender does not match SPF record (softfail) -0.0 T_SCC_BODY_TEXT_LINE No description available. 0.3 KHOP_HELO_FCRDNS Relay HELO differs from its IP's reverse DNS X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) 8 maj 2022 kl. 21.31 skrev Simen Heggest=C3=B8yl : > Am I right in thinking that the use of > `regexp-quote' in the `csv-set-separator' case gives the right result? Yes, I think so. `csv-set-separator` should probably escape the = character in `csv--skip-chars`, however: (setq-local csv--skip-chars (format "^\n%c" sep)) should be (setq-local csv--skip-chars (format "^\n\\%c" sep)) I'm not sure if a separator can be chosen that needs escaping here but = better be safe; who knows how the code will be used. From unknown Mon Jun 23 15:02:05 2025 X-Loop: help-debbugs@gnu.org Subject: bug#55315: [elpa/csv-mode] [PATCH] CSV separator guessing Resent-From: Simen =?UTF-8?Q?Heggest=C3=B8yl?= Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Mon, 09 May 2022 11:05:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 55315 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: patch To: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= Cc: 55315@debbugs.gnu.org Received: via spool by 55315-submit@debbugs.gnu.org id=B55315.165209424930308 (code B ref 55315); Mon, 09 May 2022 11:05:02 +0000 Received: (at 55315) by debbugs.gnu.org; 9 May 2022 11:04:09 +0000 Received: from localhost ([127.0.0.1]:56366 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1no1BQ-0007sm-7y for submit@debbugs.gnu.org; Mon, 09 May 2022 07:04:08 -0400 Received: from mailtransmit04.runbox.com ([185.226.149.37]:59012) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1no1BN-0007sB-2D for 55315@debbugs.gnu.org; Mon, 09 May 2022 07:04:06 -0400 Received: from mailtransmit02.runbox ([10.9.9.162] helo=aibo.runbox.com) by mailtransmit04.runbox.com with esmtps (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.93) (envelope-from ) id 1no1BG-008wzG-Am for 55315@debbugs.gnu.org; Mon, 09 May 2022 13:03:58 +0200 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=runbox.com; s=selector2; h=Content-Type:MIME-Version:Message-ID:In-Reply-To:Date: References:Subject:Cc:To:From; bh=rZsKDlRObOTwYvk08nZlQTRy4LWZlCZMxN3ePxlyVPQ=; b=3GTdvscYdAqCm9uauBoKyXxU7p eA1qUwGmp4p0HUdcCbWMVaKaS5h8KJAImir8C2vymlYF8sA51FsyAMe7dxFhlZctbAaY8s4dSzPWx XKZP7OVnJShmiftqjxjqdmESUbAAY0iiMQ01LpcevV7M8pe5Ed8Gyc/oHW3WmPqpDpiWPSXZ3AeYC dcszoVYdAhxv1P0wVtLgEQiTbJF9XtO0/TGFOqDxApOpogntFxExWEIO+jZb3k4cfb7N3yY/CnZId ge1lp1zQxkYzeZT3VEaoVhInqI3wWG/pQuD9fV/GtSDxg7qwRHnahoYXAd5RWPXmC6ZzCtvq4KedM ADPvL2WQ==; Received: from [10.9.9.74] (helo=submission03.runbox) by mailtransmit02.runbox with esmtp (Exim 4.86_2) (envelope-from ) id 1no1BF-00018O-Pb; Mon, 09 May 2022 13:03:57 +0200 Received: by submission03.runbox with esmtpsa [Authenticated ID (963757)] (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) id 1no1B7-0007PW-CI; Mon, 09 May 2022 13:03:49 +0200 From: Simen =?UTF-8?Q?Heggest=C3=B8yl?= References: <07E204D4-5FE4-4122-BB82-EBB2107C09E8@acm.org> <87mtfryg73.fsf@simenheg@gmail.com> <288326F0-A0EA-4CCE-B1E7-C8184255B046@acm.org> Date: Mon, 09 May 2022 13:03:48 +0200 In-Reply-To: <288326F0-A0EA-4CCE-B1E7-C8184255B046@acm.org> ("Mattias =?UTF-8?Q?Engdeg=C3=A5rd?="'s message of "Mon, 9 May 2022 11:37:41 +0200") Message-ID: <87a6brarxn.fsf@runbox.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.1 (gnu/linux) MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-Spam-Score: -0.0 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Mattias Engdeg=C3=A5rd writes: > 8 maj 2022 kl. 21.31 skrev Simen Heggest=C3=B8yl : > >> Am I right in thinking that the use of >> `regexp-quote' in the `csv-set-separator' case gives the right result? > > Yes, I think so. `csv-set-separator` should probably escape the character= in `csv--skip-chars`, however: > > (setq-local csv--skip-chars (format "^\n%c" sep)) > > should be > > (setq-local csv--skip-chars (format "^\n\\%c" sep)) > > I'm not sure if a separator can be chosen that needs escaping here but > better be safe; who knows how the code will be used. Ah, thanks, I misread the docstring of `skip-chars-forward': (but not at the end of a range; quoting is never needed there) I somehow misinterpreted that as quoting not being necessary at the end of the string fed to `skip-chars-forward'. Updated patch with your proposed fix attached. --=-=-= Content-Type: text/x-diff; charset=utf-8 Content-Disposition: attachment; filename=0001-Add-CSV-separator-guessing-functionality.patch Content-Transfer-Encoding: quoted-printable >From 872d7f08c47fa382ae18171a0806afa110de8fbe Mon Sep 17 00:00:00 2001 From: =3D?UTF-8?q?Simen=3D20Heggest=3DC3=3DB8yl?=3D Date: Sun, 8 May 2022 16:01:35 +0200 Subject: [PATCH] Add CSV separator guessing functionality Add two new commands: `csv-guess-set-separator' that automatically guesses and sets the CSV separator of the current buffer, and `csv-set-separator' for setting it manually. `csv-guess-set-separator' can be useful to add to the mode hook to have CSV mode guess and set the separator automatically when visiting a buffer: (add-hook 'csv-mode-hook 'csv-guess-set-separator) * csv-mode.el (csv-separators): Properly quote regexp values. (csv--set-separator-history, csv--preferred-separators): New variables. (csv-set-separator, csv-guess-set-separator) (csv-guess-separator, csv--separator-candidates) (csv--separator-score): New functions. * csv-mode-tests.el (csv-tests--data): New test data. (csv-tests-guess-separator, csv-tests-separator-candidates) (csv-tests-separator-score): New tests. --- csv-mode-tests.el | 80 ++++++++++++++++++++------- csv-mode.el | 137 +++++++++++++++++++++++++++++++++++++++++++--- 2 files changed, 187 insertions(+), 30 deletions(-) diff --git a/csv-mode-tests.el b/csv-mode-tests.el index 316dc4bb93..0caeab7d80 100644 --- a/csv-mode-tests.el +++ b/csv-mode-tests.el @@ -1,8 +1,8 @@ ;;; csv-mode-tests.el --- Tests for CSV mode -*- lexical-binding: = t; -*- =20 -;; Copyright (C) 2020 Free Software Foundation, Inc +;; Copyright (C) 2020-2022 Free Software Foundation, Inc =20 -;; Author: Simen Heggest=C3=B8yl +;; Author: Simen Heggest=C3=B8yl ;; Keywords: =20 ;; This program is free software; you can redistribute it and/or modify @@ -28,83 +28,121 @@ (require 'csv-mode) (eval-when-compile (require 'subr-x)) =20 -(ert-deftest csv-mode-tests-end-of-field () +(ert-deftest csv-tests-end-of-field () (with-temp-buffer (csv-mode) (insert "aaa,bbb") (goto-char (point-min)) (csv-end-of-field) - (should (equal (buffer-substring (point-min) (point)) - "aaa")) + (should (equal (buffer-substring (point-min) (point)) "aaa")) (forward-char) (csv-end-of-field) (should (equal (buffer-substring (point-min) (point)) "aaa,bbb")))) =20 -(ert-deftest csv-mode-tests-end-of-field-with-quotes () +(ert-deftest csv-tests-end-of-field-with-quotes () (with-temp-buffer (csv-mode) (insert "aaa,\"b,b\"") (goto-char (point-min)) (csv-end-of-field) - (should (equal (buffer-substring (point-min) (point)) - "aaa")) + (should (equal (buffer-substring (point-min) (point)) "aaa")) (forward-char) (csv-end-of-field) (should (equal (buffer-substring (point-min) (point)) "aaa,\"b,b\"")))) =20 -(ert-deftest csv-mode-tests-beginning-of-field () +(ert-deftest csv-tests-beginning-of-field () (with-temp-buffer (csv-mode) (insert "aaa,bbb") (csv-beginning-of-field) - (should (equal (buffer-substring (point) (point-max)) - "bbb")) + (should (equal (buffer-substring (point) (point-max)) "bbb")) (backward-char) (csv-beginning-of-field) (should (equal (buffer-substring (point) (point-max)) "aaa,bbb")))) =20 -(ert-deftest csv-mode-tests-beginning-of-field-with-quotes () +(ert-deftest csv-tests-beginning-of-field-with-quotes () (with-temp-buffer (csv-mode) (insert "aaa,\"b,b\"") (csv-beginning-of-field) - (should (equal (buffer-substring (point) (point-max)) - "\"b,b\"")) + (should (equal (buffer-substring (point) (point-max)) "\"b,b\"")) (backward-char) (csv-beginning-of-field) (should (equal (buffer-substring (point) (point-max)) "aaa,\"b,b\"")))) =20 -(defun csv-mode-tests--align-fields (before after) +(defun csv-tests--align-fields (before after) (with-temp-buffer (insert (string-join before "\n")) (csv-align-fields t (point-min) (point-max)) (should (equal (buffer-string) (string-join after "\n"))))) =20 -(ert-deftest csv-mode-tests-align-fields () - (csv-mode-tests--align-fields +(ert-deftest csv-tests-align-fields () + (csv-tests--align-fields '("aaa,bbb,ccc" "1,2,3") '("aaa, bbb, ccc" "1 , 2 , 3"))) =20 -(ert-deftest csv-mode-tests-align-fields-with-quotes () - (csv-mode-tests--align-fields +(ert-deftest csv-tests-align-fields-with-quotes () + (csv-tests--align-fields '("aaa,\"b,b\",ccc" "1,2,3") '("aaa, \"b,b\", ccc" "1 , 2 , 3"))) =20 ;; Bug#14053 -(ert-deftest csv-mode-tests-align-fields-double-quote-comma () - (csv-mode-tests--align-fields +(ert-deftest csv-tests-align-fields-double-quote-comma () + (csv-tests--align-fields '("1,2,3" "a,\"b\"\"c,\",d") '("1, 2 , 3" "a, \"b\"\"c,\", d"))) =20 +(defvar csv-tests--data + "1,4;Sun, 2022-04-10;4,12 +8;Mon, 2022-04-11;3,19 +3,2;Tue, 2022-04-12;1,00 +2;Wed, 2022-04-13;0,37 +9;Wed, 2022-04-13;0,37") + +(ert-deftest csv-tests-guess-separator () + (should-not (csv-guess-separator "")) + (should (=3D (csv-guess-separator csv-tests--data 3) ?,)) + (should (=3D (csv-guess-separator csv-tests--data) ?\;)) + (should (=3D (csv-guess-separator csv-tests--data) + (csv-guess-separator csv-tests--data + (length csv-tests--data))))) + +(ert-deftest csv-tests-separator-candidates () + (should-not (csv--separator-candidates "")) + (should-not (csv--separator-candidates csv-tests--data 0)) + (should + (equal (sort (csv--separator-candidates csv-tests--data 4) #'<) + '(?, ?\;))) + (should + (equal (sort (csv--separator-candidates csv-tests--data) #'<) + '(?\s ?, ?- ?\;))) + (should + (equal + (sort (csv--separator-candidates csv-tests--data) #'<) + (sort (csv--separator-candidates csv-tests--data + (length csv-tests--data)) + #'<)))) + +(ert-deftest csv-tests-separator-score () + (should (< (csv--separator-score ?, csv-tests--data) + (csv--separator-score ?\s csv-tests--data) + (csv--separator-score ?- csv-tests--data))) + (should (=3D (csv--separator-score ?- csv-tests--data) + (csv--separator-score ?\; csv-tests--data))) + (should (=3D 0 (csv--separator-score ?\; csv-tests--data 0))) + (should (=3D (csv--separator-score ?\; csv-tests--data) + (csv--separator-score ?\; csv-tests--data + (length csv-tests--data))))) + (provide 'csv-mode-tests) ;;; csv-mode-tests.el ends here diff --git a/csv-mode.el b/csv-mode.el index 10ce166052..b2a881dde2 100644 --- a/csv-mode.el +++ b/csv-mode.el @@ -1,11 +1,11 @@ ;;; csv-mode.el --- Major mode for editing comma/char separated values -*= - lexical-binding: t -*- =20 -;; Copyright (C) 2003, 2004, 2012-2020 Free Software Foundation, Inc +;; Copyright (C) 2003, 2004, 2012-2022 Free Software Foundation, Inc =20 ;; Author: "Francis J. Wright" ;; Maintainer: emacs-devel@gnu.org ;; Version: 1.19 -;; Package-Requires: ((emacs "24.1") (cl-lib "0.5")) +;; Package-Requires: ((emacs "27.1") (cl-lib "0.5")) ;; Keywords: convenience =20 ;; This package is free software; you can redistribute it and/or modify @@ -119,7 +119,9 @@ =20 ;;; Code: =20 -(eval-when-compile (require 'cl-lib)) +(eval-when-compile + (require 'cl-lib) + (require 'subr-x)) =20 (defgroup CSV nil "Major mode for editing files of comma-separated value type." @@ -163,12 +165,14 @@ session. Use `customize-set-variable' instead if tha= t is required." (error "%S is already a quote" x))) value) (custom-set-default variable value) - (setq csv-separator-chars (mapcar #'string-to-char value) - csv--skip-chars (apply #'concat "^\n" csv-separators) - csv-separator-regexp (apply #'concat `("[" ,@value "]")) - csv-font-lock-keywords - ;; NB: csv-separator-face variable evaluates to itself. - `((,csv-separator-regexp (0 'csv-separator-face)))))) + (setq csv-separator-chars (mapcar #'string-to-char value)) + (setq csv--skip-chars + (apply #'concat "^\n" + (mapcar (lambda (s) (concat "\\" s)) value))) + (setq csv-separator-regexp (regexp-opt value)) + (setq csv-font-lock-keywords + ;; NB: csv-separator-face variable evaluates to itself. + `((,csv-separator-regexp (0 'csv-separator-face)))))) =20 (defcustom csv-field-quotes '("\"") "Field quotes: a list of *single-character* strings. @@ -368,6 +372,23 @@ It must be either a string or nil." (modify-syntax-entry ?\n ">" csv-mode-syntax-table)) (setq csv-comment-start string)) =20 +(defvar csv--set-separator-history nil) + +(defun csv-set-separator (sep) + "Set the CSV separator in the current buffer to SEP." + (interactive (list (read-char-from-minibuffer + "Separator: " nil 'csv--set-separator-history))) + (when (and (boundp 'csv-field-quotes) + (member (string sep) csv-field-quotes)) + (error "%c is already a quote" sep)) + (setq-local csv-separators (list (string sep))) + (setq-local csv-separator-chars (list sep)) + (setq-local csv--skip-chars (format "^\n\\%c" sep)) + (setq-local csv-separator-regexp (regexp-quote (string sep))) + (setq-local csv-font-lock-keywords + `((,csv-separator-regexp (0 'csv-separator-face)))) + (font-lock-refresh-defaults)) + ;;;###autoload (add-to-list 'auto-mode-alist '("\\.[Cc][Ss][Vv]\\'" . csv-mode)) =20 @@ -1728,6 +1749,104 @@ setting works better)." (jit-lock-unregister #'csv--jit-align) (csv--jit-unalign (point-min) (point-max)))) (csv--header-flush)) + +;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +;;; Separator guessing +;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + +(defvar csv--preferred-separators + '(?\t ?\s ?, ?: ?\;) + "Preferred separator characters in case of a tied score.") + +(defun csv-guess-set-separator () + "Guess and set the CSV separator of the current buffer. + +Add it to the mode hook to have CSV mode guess and set the +separator automatically when visiting a buffer: + + (add-hook \\=3D'csv-mode-hook \\=3D'csv-guess-set-separator)" + (interactive) + (let ((sep (csv-guess-separator + (buffer-substring-no-properties + (point-min) + ;; We're probably only going to look at the first 2048 + ;; or so chars, but take more than we probably need to + ;; minimize the chance of breaking the input in the + ;; middle of a (long) row. + (min 8192 (point-max))) + 2048))) + (when sep + (csv-set-separator sep)))) + +(defun csv-guess-separator (text &optional cutoff) + "Return a guess of which character is the CSV separator in TEXT." + (let ((best-separator nil) + (best-score 0)) + (dolist (candidate (csv--separator-candidates text cutoff)) + (let ((candidate-score + (csv--separator-score candidate text cutoff))) + (when (or (> candidate-score best-score) + (and (=3D candidate-score best-score) + (member candidate csv--preferred-separators))) + (setq best-separator candidate) + (setq best-score candidate-score)))) + best-separator)) + +(defun csv--separator-candidates (text &optional cutoff) + "Return a list of candidate CSV separators in TEXT. +When CUTOFF is passed, look only at the first CUTOFF number of characters." + (let ((chars (make-hash-table))) + (dolist (c (string-to-list + (if cutoff + (substring text 0 (min cutoff (length text))) + text))) + (when (and (not (gethash c chars)) + (or (=3D c ?\t) + (and (not (member c '(?. ?/ ?\" ?'))) + (not (member (get-char-code-property c 'general-= category) + '(Lu Ll Lt Lm Lo Nd Nl No Ps Pe Cc = Co)))))) + (puthash c t chars))) + (hash-table-keys chars))) + +(defun csv--separator-score (separator text &optional cutoff) + "Return a score on how likely SEPARATOR is a separator in TEXT. + +When CUTOFF is passed, stop the calculation at the next whole +line after having read CUTOFF number of characters. + +The scoring is based on the idea that most CSV data is tabular, +i.e. separators should appear equally often on each line. +Furthermore, more commonly appearing characters are scored higher +than those who appear less often. + +Adapted from the paper \"Wrangling Messy CSV Files by Detecting +Row and Type Patterns\" by Gerrit J.J. van den Burg , Alfredo +Naz=C3=A1bal, and Charles Sutton: https://arxiv.org/abs/1811.11242." + (let ((groups + (with-temp-buffer + (csv-set-separator separator) + (save-excursion + (insert text)) + (let ((groups (make-hash-table)) + (chars-read 0)) + (while (and (/=3D (point) (point-max)) + (or (not cutoff) + (< chars-read cutoff))) + (let* ((lep (line-end-position)) + (nfields (length (csv--collect-fields lep)))) + (cl-incf (gethash nfields groups 0)) + (cl-incf chars-read (- lep (point))) + (goto-char (+ lep 1)))) + groups))) + (sum 0)) + (maphash + (lambda (length num) + (cl-incf sum (* num (/ (- length 1) (float length))))) + groups) + (let ((unique-groups (hash-table-count groups))) + (if (=3D 0 unique-groups) + 0 + (/ sum unique-groups))))) =20 ;;; TSV support =20 --=20 2.35.1 --=-=-=-- From unknown Mon Jun 23 15:02:05 2025 X-Loop: help-debbugs@gnu.org Subject: bug#55315: [elpa/csv-mode] [PATCH] CSV separator guessing Resent-From: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Mon, 09 May 2022 11:30:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 55315 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: patch To: Simen =?UTF-8?Q?Heggest=C3=B8yl?= Cc: 55315@debbugs.gnu.org Received: via spool by 55315-submit@debbugs.gnu.org id=B55315.16520957438897 (code B ref 55315); Mon, 09 May 2022 11:30:02 +0000 Received: (at 55315) by debbugs.gnu.org; 9 May 2022 11:29:03 +0000 Received: from localhost ([127.0.0.1]:56409 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1no1ZX-0002JR-6Q for submit@debbugs.gnu.org; Mon, 09 May 2022 07:29:03 -0400 Received: from mail1433c50.megamailservers.eu ([91.136.14.33]:53266 helo=mail263c50.megamailservers.eu) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1no1ZV-0002Io-17 for 55315@debbugs.gnu.org; Mon, 09 May 2022 07:29:02 -0400 X-Authenticated-User: mattiase@bredband.net DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=megamailservers.eu; s=maildub; t=1652095733; bh=wu0OIqrm+SCI638MHv0U54SKv04HmyLtALERIDVF5xU=; h=Subject:From:In-Reply-To:Date:Cc:References:To:From; b=A8aiAoRp7p/D7kHRMhBIRWU6Hm3rA1WrVQ/y+mIN8YV4Vso8sN7YhnrROqKBVI5v3 TLkB6KF+JIqSvSoWYsrKwY8VnhaA/HiVJ5aSusb4PZaVKwAboeidMwaZk99Ry7bTZz hZ4Dr5+Wq4GKJV4xmu9kuzsWwnskbJw0Sn2t0ytE= Feedback-ID: mattiase@acm.or Received: from smtpclient.apple (c188-150-171-71.bredband.tele2.se [188.150.171.71]) (authenticated bits=0) by mail263c50.megamailservers.eu (8.14.9/8.13.1) with ESMTP id 249BSoeF005424; Mon, 9 May 2022 11:28:52 +0000 Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.120.0.1.13\)) From: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= In-Reply-To: <87a6brarxn.fsf@runbox.com> Date: Mon, 9 May 2022 13:28:50 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <62CF0937-E9D8-47C5-A65F-2EFFF2CDFB65@acm.org> References: <07E204D4-5FE4-4122-BB82-EBB2107C09E8@acm.org> <87mtfryg73.fsf@simenheg@gmail.com> <288326F0-A0EA-4CCE-B1E7-C8184255B046@acm.org> <87a6brarxn.fsf@runbox.com> X-Mailer: Apple Mail (2.3654.120.0.1.13) X-CTCH-RefID: str=0001.0A742F22.6278FAF5.0018, ss=1, re=0.000, recu=0.000, reip=0.000, cl=1, cld=1, fgs=0 X-CTCH-VOD: Unknown X-CTCH-Spam: Unknown X-CTCH-Score: 0.000 X-CTCH-Rules: X-CTCH-Flags: 0 X-CTCH-ScoreCust: 0.000 X-Origin-Country: SE X-Spam-Score: 1.3 (+) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has NOT identified this incoming email as spam. The original message has been attached to this so you can view it or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: 9 maj 2022 kl. 13.03 skrev Simen =?UTF-8?Q?Heggest=C3=B8yl?= : > Updated patch with your proposed fix attached. Thanks, looks fine with respect to the regexp and skip-set generation. For the remainder of the patch (the vast bulk) you are probably more qualified to judge! Content analysis details: (1.3 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- 0.0 SPF_HELO_NONE SPF: HELO does not publish an SPF Record 1.0 SPF_SOFTFAIL SPF: sender does not match SPF record (softfail) -0.0 T_SCC_BODY_TEXT_LINE No description available. 0.3 KHOP_HELO_FCRDNS Relay HELO differs from its IP's reverse DNS X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) 9 maj 2022 kl. 13.03 skrev Simen Heggest=C3=B8yl : > Updated patch with your proposed fix attached. Thanks, looks fine with respect to the regexp and skip-set generation. For the remainder of the patch (the vast bulk) you are probably more = qualified to judge! By the way, thanks for the reference to the CSV wrangling paper. From unknown Mon Jun 23 15:02:05 2025 MIME-Version: 1.0 X-Mailer: MIME-tools 5.505 (Entity 5.505) X-Loop: help-debbugs@gnu.org From: help-debbugs@gnu.org (GNU bug Tracking System) To: Simen =?UTF-8?Q?Heggest=C3=B8yl?= Subject: bug#55315: closed (Re: bug#55315: [elpa/csv-mode] [PATCH] CSV separator guessing) Message-ID: References: <87wneqed3v.fsf@simenheg@gmail.com> <87h760jeq7.fsf@simenheg@gmail.com> X-Gnu-PR-Message: they-closed 55315 X-Gnu-PR-Package: emacs X-Gnu-PR-Keywords: patch Reply-To: 55315@debbugs.gnu.org Date: Thu, 12 May 2022 20:00:02 +0000 Content-Type: multipart/mixed; boundary="----------=_1652385602-5031-1" This is a multi-part message in MIME format... ------------=_1652385602-5031-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Your bug report #55315: [elpa/csv-mode] [PATCH] CSV separator guessing which was filed against the emacs package, has been closed. The explanation is attached below, along with your original report. If you require more details, please reply to 55315@debbugs.gnu.org. --=20 55315: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D55315 GNU Bug Tracking System Contact help-debbugs@gnu.org with problems ------------=_1652385602-5031-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at 55315-done) by debbugs.gnu.org; 12 May 2022 19:59:52 +0000 Received: from localhost ([127.0.0.1]:42332 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1npEyW-0001IS-BP for submit@debbugs.gnu.org; Thu, 12 May 2022 15:59:52 -0400 Received: from mailtransmit05.runbox.com ([185.226.149.38]:42160) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1npEyT-0001IA-TN for 55315-done@debbugs.gnu.org; Thu, 12 May 2022 15:59:50 -0400 Received: from mailtransmit02.runbox ([10.9.9.162] helo=aibo.runbox.com) by mailtransmit05.runbox.com with esmtps (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.93) (envelope-from ) id 1npEyM-00Gxt3-LK; Thu, 12 May 2022 21:59:42 +0200 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=runbox.com; s=selector2; h=Content-Transfer-Encoding:Content-Type:MIME-Version: Message-ID:In-Reply-To:Date:References:Subject:Cc:To:From; bh=Qbsn4yyaW7mertJ/rOC478x9XNgq1jSaZE124Lto9TU=; b=YDyhydke3ql9UUz5y5RVhvadGO VqjZpMWf+KMhZbZ1OCENrlp3u26y9esMGHQd+WUoOaJsx/iX1CD+zN+1KL26jdBTNeFbdPbKo33Xi YZjkmz2l6uWIMp9MI655dLrI28cgCKncvAO8Gmy6BHY5gJhuYS8xaWxwwKl520IfvKuIBHkA1VYxS KzecvVcUiG+MojiDwcamN31Fl9u7Y+NHUtyOzCakII3VBAHYl0gFOuGecyooXBMryg6VFUZydaFeo 9uNJLr6pnrshGyNJIdEe5h0S7TbSwsVqQEIi1jK0Ztd7lohDs2OtXElpOHbB4OX9I/SZ1P7qZFRCy txsi1AyQ==; Received: from [10.9.9.74] (helo=submission03.runbox) by mailtransmit02.runbox with esmtp (Exim 4.86_2) (envelope-from ) id 1npEyM-00020g-2N; Thu, 12 May 2022 21:59:42 +0200 Received: by submission03.runbox with esmtpsa [Authenticated ID (963757)] (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) id 1npEyC-000142-PM; Thu, 12 May 2022 21:59:32 +0200 From: =?utf-8?Q?Simen_Heggest=C3=B8yl?= To: Mattias =?utf-8?Q?Engdeg=C3=A5rd?= Subject: Re: bug#55315: [elpa/csv-mode] [PATCH] CSV separator guessing References: <07E204D4-5FE4-4122-BB82-EBB2107C09E8@acm.org> <87mtfryg73.fsf@simenheg@gmail.com> <288326F0-A0EA-4CCE-B1E7-C8184255B046@acm.org> <87a6brarxn.fsf@runbox.com> <62CF0937-E9D8-47C5-A65F-2EFFF2CDFB65@acm.org> Date: Thu, 12 May 2022 21:59:32 +0200 In-Reply-To: <62CF0937-E9D8-47C5-A65F-2EFFF2CDFB65@acm.org> ("Mattias =?utf-8?Q?Engdeg=C3=A5rd=22's?= message of "Mon, 9 May 2022 13:28:50 +0200") Message-ID: <87wneqed3v.fsf@simenheg@gmail.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: 1.0 (+) X-Debbugs-Envelope-To: 55315-done Cc: 55315-done@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) Mattias Engdeg=C3=A5rd writes: > 9 maj 2022 kl. 13.03 skrev Simen Heggest=C3=B8yl : > >> Updated patch with your proposed fix attached. > > Thanks, looks fine with respect to the regexp and skip-set generation. Good! Thanks for taking another look. I've merged the patch. -- Simen ------------=_1652385602-5031-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at submit) by debbugs.gnu.org; 8 May 2022 14:12:32 +0000 Received: from localhost ([127.0.0.1]:54959 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1nnhe8-0003J3-L9 for submit@debbugs.gnu.org; Sun, 08 May 2022 10:12:32 -0400 Received: from lists.gnu.org ([209.51.188.17]:60246) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1nnhe5-0003Iv-LS for submit@debbugs.gnu.org; Sun, 08 May 2022 10:12:26 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:37736) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nnhe5-00064t-CQ for bug-gnu-emacs@gnu.org; Sun, 08 May 2022 10:12:25 -0400 Received: from mailtransmit05.runbox.com ([2a0c:5a00:149::26]:39210) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nnhe1-00016I-LD for bug-gnu-emacs@gnu.org; Sun, 08 May 2022 10:12:24 -0400 Received: from mailtransmit03.runbox ([10.9.9.163] helo=aibo.runbox.com) by mailtransmit05.runbox.com with esmtps (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.93) (envelope-from ) id 1nnhdv-0079Fx-1v for bug-gnu-emacs@gnu.org; Sun, 08 May 2022 16:12:15 +0200 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=runbox.com; s=selector2; h=Content-Type:MIME-Version:Message-ID:Date:Subject:To:From; bh=DF8lGvDUd4/t9H7P7DcM+DwOq49eLjPEhl3n7Hxbj/A=; b=ZmdOQmq0by+fZvnoU9rsq8+hG fVSr3OfqE4TS7eniPFL6ucfscUAYunbygwc5RW8+ml97TigRzSPkhw543Swn+usvqox7qYesMBrVk Um3pqQREpCs1nyi6V+p1CwCLFIEp+A36zU7o/uSd05YwNDjavr2KstxIdtjjkPPOgfnz32ZRXkndc 7Pop6Nlkgix0xc1E05qSyhRoUhL8GRgZqI47g1THJ8NH6cWcks+3eD53QI7jMIDVtGonZ75MqKZtK AcyZGhkS5//gGjpUtj0SXbzhlIpH3AHG2GoJ0tzaX7FMtOKTCCFho8AITHsPe77OISO+qiwWTy1Z3 ZJ0wDWAbg==; Received: from [10.9.9.74] (helo=submission03.runbox) by mailtransmit03.runbox with esmtp (Exim 4.86_2) (envelope-from ) id 1nnhdu-0007Fm-GM for bug-gnu-emacs@gnu.org; Sun, 08 May 2022 16:12:14 +0200 Received: by submission03.runbox with esmtpsa [Authenticated ID (963757)] (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) id 1nnhdh-0006GA-ME for bug-gnu-emacs@gnu.org; Sun, 08 May 2022 16:12:01 +0200 From: =?utf-8?Q?Simen_Heggest=C3=B8yl?= To: bug-gnu-emacs@gnu.org Subject: [elpa/csv-mode] [PATCH] CSV separator guessing Date: Sun, 08 May 2022 16:12:00 +0200 Message-ID: <87h760jeq7.fsf@simenheg@gmail.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.1 (gnu/linux) MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" Received-SPF: pass client-ip=2a0c:5a00:149::26; envelope-from=simenheg@runbox.com; helo=mailtransmit05.runbox.com X-Spam_score_int: -17 X-Spam_score: -1.8 X-Spam_bar: - X-Spam_report: (-1.8 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, MSGID_MULTIPLE_AT=1, RCVD_IN_DNSWL_LOW=-0.7, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-Spam-Score: 1.2 (+) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has NOT identified this incoming email as spam. The original message has been attached to this so you can view it or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: Hi. Attached is a proposed patch to csv-mode.el in GNU ELPA which adds CSV separator guessing functionality to CSV mode. It adds two new commands: `csv-guess-set-separator' that automatically guesses and sets the CSV separator of the current buffer, and `csv-set-separator' for setting it manually. Content analysis details: (1.2 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- 1.0 MSGID_MULTIPLE_AT Message-ID contains multiple '@' characters -0.0 SPF_HELO_PASS SPF: HELO matches SPF record 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider (simenheg[at]runbox.com) 0.9 SPF_FAIL SPF: sender does not match SPF record (fail) [SPF failed: Please see http://www.openspf.org/Why?s=mfrom; id=simenheg%40runbox.com; ip=209.51.188.17; r=debbugs.gnu.org] -2.3 RCVD_IN_DNSWL_MED RBL: Sender listed at https://www.dnswl.org/, medium trust [209.51.188.17 listed in list.dnswl.org] -0.0 RCVD_IN_MSPIKE_H2 RBL: Average reputation (+2) [209.51.188.17 listed in wl.mailspike.net] -0.0 T_SCC_BODY_TEXT_LINE No description available. 1.5 SPOOFED_FREEMAIL No description available. X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) --=-=-= Content-Type: text/plain Hi. Attached is a proposed patch to csv-mode.el in GNU ELPA which adds CSV separator guessing functionality to CSV mode. It adds two new commands: `csv-guess-set-separator' that automatically guesses and sets the CSV separator of the current buffer, and `csv-set-separator' for setting it manually. The idea is that `csv-guess-set-separator' can be useful to add to the mode hook to have CSV mode guess and set the separator automatically when visiting a buffer: (add-hook 'csv-mode-hook 'csv-guess-set-separator) Been using it myself for the past weeks and have been happy with it so far. --=-=-= Content-Type: text/x-diff; charset=utf-8 Content-Disposition: attachment; filename=0001-Add-CSV-separator-guessing-functionality.patch Content-Transfer-Encoding: quoted-printable >From 7414f7e17ede47c392ce8d401d28ef17513c10e7 Mon Sep 17 00:00:00 2001 From: =3D?UTF-8?q?Simen=3D20Heggest=3DC3=3DB8yl?=3D Date: Sun, 8 May 2022 16:01:35 +0200 Subject: [PATCH] Add CSV separator guessing functionality Add two new commands: `csv-guess-set-separator' that automatically guesses and sets the CSV separator of the current buffer, and `csv-set-separator' for setting it manually. `csv-guess-set-separator' can be useful to add to the mode hook to have CSV mode guess and set the separator automatically when visiting a buffer: (add-hook 'csv-mode-hook 'csv-guess-set-separator) * csv-mode.el (csv-separators): Properly quote regexp values. (csv--set-separator-history, csv--preferred-separators): New variables. (csv-set-separator, csv-guess-set-separator) (csv-guess-separator, csv--separator-candidates) (csv--separator-score): New functions. * csv-mode-tests.el (csv-tests--data): New test data. (csv-tests-guess-separator, csv-tests-separator-candidates) (csv-tests-separator-score): New tests. --- csv-mode-tests.el | 80 ++++++++++++++++++++------- csv-mode.el | 138 +++++++++++++++++++++++++++++++++++++++++++--- 2 files changed, 188 insertions(+), 30 deletions(-) diff --git a/csv-mode-tests.el b/csv-mode-tests.el index 316dc4bb93..0caeab7d80 100644 --- a/csv-mode-tests.el +++ b/csv-mode-tests.el @@ -1,8 +1,8 @@ ;;; csv-mode-tests.el --- Tests for CSV mode -*- lexical-binding: = t; -*- =20 -;; Copyright (C) 2020 Free Software Foundation, Inc +;; Copyright (C) 2020-2022 Free Software Foundation, Inc =20 -;; Author: Simen Heggest=C3=B8yl +;; Author: Simen Heggest=C3=B8yl ;; Keywords: =20 ;; This program is free software; you can redistribute it and/or modify @@ -28,83 +28,121 @@ (require 'csv-mode) (eval-when-compile (require 'subr-x)) =20 -(ert-deftest csv-mode-tests-end-of-field () +(ert-deftest csv-tests-end-of-field () (with-temp-buffer (csv-mode) (insert "aaa,bbb") (goto-char (point-min)) (csv-end-of-field) - (should (equal (buffer-substring (point-min) (point)) - "aaa")) + (should (equal (buffer-substring (point-min) (point)) "aaa")) (forward-char) (csv-end-of-field) (should (equal (buffer-substring (point-min) (point)) "aaa,bbb")))) =20 -(ert-deftest csv-mode-tests-end-of-field-with-quotes () +(ert-deftest csv-tests-end-of-field-with-quotes () (with-temp-buffer (csv-mode) (insert "aaa,\"b,b\"") (goto-char (point-min)) (csv-end-of-field) - (should (equal (buffer-substring (point-min) (point)) - "aaa")) + (should (equal (buffer-substring (point-min) (point)) "aaa")) (forward-char) (csv-end-of-field) (should (equal (buffer-substring (point-min) (point)) "aaa,\"b,b\"")))) =20 -(ert-deftest csv-mode-tests-beginning-of-field () +(ert-deftest csv-tests-beginning-of-field () (with-temp-buffer (csv-mode) (insert "aaa,bbb") (csv-beginning-of-field) - (should (equal (buffer-substring (point) (point-max)) - "bbb")) + (should (equal (buffer-substring (point) (point-max)) "bbb")) (backward-char) (csv-beginning-of-field) (should (equal (buffer-substring (point) (point-max)) "aaa,bbb")))) =20 -(ert-deftest csv-mode-tests-beginning-of-field-with-quotes () +(ert-deftest csv-tests-beginning-of-field-with-quotes () (with-temp-buffer (csv-mode) (insert "aaa,\"b,b\"") (csv-beginning-of-field) - (should (equal (buffer-substring (point) (point-max)) - "\"b,b\"")) + (should (equal (buffer-substring (point) (point-max)) "\"b,b\"")) (backward-char) (csv-beginning-of-field) (should (equal (buffer-substring (point) (point-max)) "aaa,\"b,b\"")))) =20 -(defun csv-mode-tests--align-fields (before after) +(defun csv-tests--align-fields (before after) (with-temp-buffer (insert (string-join before "\n")) (csv-align-fields t (point-min) (point-max)) (should (equal (buffer-string) (string-join after "\n"))))) =20 -(ert-deftest csv-mode-tests-align-fields () - (csv-mode-tests--align-fields +(ert-deftest csv-tests-align-fields () + (csv-tests--align-fields '("aaa,bbb,ccc" "1,2,3") '("aaa, bbb, ccc" "1 , 2 , 3"))) =20 -(ert-deftest csv-mode-tests-align-fields-with-quotes () - (csv-mode-tests--align-fields +(ert-deftest csv-tests-align-fields-with-quotes () + (csv-tests--align-fields '("aaa,\"b,b\",ccc" "1,2,3") '("aaa, \"b,b\", ccc" "1 , 2 , 3"))) =20 ;; Bug#14053 -(ert-deftest csv-mode-tests-align-fields-double-quote-comma () - (csv-mode-tests--align-fields +(ert-deftest csv-tests-align-fields-double-quote-comma () + (csv-tests--align-fields '("1,2,3" "a,\"b\"\"c,\",d") '("1, 2 , 3" "a, \"b\"\"c,\", d"))) =20 +(defvar csv-tests--data + "1,4;Sun, 2022-04-10;4,12 +8;Mon, 2022-04-11;3,19 +3,2;Tue, 2022-04-12;1,00 +2;Wed, 2022-04-13;0,37 +9;Wed, 2022-04-13;0,37") + +(ert-deftest csv-tests-guess-separator () + (should-not (csv-guess-separator "")) + (should (=3D (csv-guess-separator csv-tests--data 3) ?,)) + (should (=3D (csv-guess-separator csv-tests--data) ?\;)) + (should (=3D (csv-guess-separator csv-tests--data) + (csv-guess-separator csv-tests--data + (length csv-tests--data))))) + +(ert-deftest csv-tests-separator-candidates () + (should-not (csv--separator-candidates "")) + (should-not (csv--separator-candidates csv-tests--data 0)) + (should + (equal (sort (csv--separator-candidates csv-tests--data 4) #'<) + '(?, ?\;))) + (should + (equal (sort (csv--separator-candidates csv-tests--data) #'<) + '(?\s ?, ?- ?\;))) + (should + (equal + (sort (csv--separator-candidates csv-tests--data) #'<) + (sort (csv--separator-candidates csv-tests--data + (length csv-tests--data)) + #'<)))) + +(ert-deftest csv-tests-separator-score () + (should (< (csv--separator-score ?, csv-tests--data) + (csv--separator-score ?\s csv-tests--data) + (csv--separator-score ?- csv-tests--data))) + (should (=3D (csv--separator-score ?- csv-tests--data) + (csv--separator-score ?\; csv-tests--data))) + (should (=3D 0 (csv--separator-score ?\; csv-tests--data 0))) + (should (=3D (csv--separator-score ?\; csv-tests--data) + (csv--separator-score ?\; csv-tests--data + (length csv-tests--data))))) + (provide 'csv-mode-tests) ;;; csv-mode-tests.el ends here diff --git a/csv-mode.el b/csv-mode.el index 10ce166052..f31f0da1f5 100644 --- a/csv-mode.el +++ b/csv-mode.el @@ -1,11 +1,11 @@ ;;; csv-mode.el --- Major mode for editing comma/char separated values -*= - lexical-binding: t -*- =20 -;; Copyright (C) 2003, 2004, 2012-2020 Free Software Foundation, Inc +;; Copyright (C) 2003, 2004, 2012-2022 Free Software Foundation, Inc =20 ;; Author: "Francis J. Wright" ;; Maintainer: emacs-devel@gnu.org ;; Version: 1.19 -;; Package-Requires: ((emacs "24.1") (cl-lib "0.5")) +;; Package-Requires: ((emacs "27.1") (cl-lib "0.5")) ;; Keywords: convenience =20 ;; This package is free software; you can redistribute it and/or modify @@ -119,7 +119,9 @@ =20 ;;; Code: =20 -(eval-when-compile (require 'cl-lib)) +(eval-when-compile + (require 'cl-lib) + (require 'subr-x)) =20 (defgroup CSV nil "Major mode for editing files of comma-separated value type." @@ -163,12 +165,14 @@ session. Use `customize-set-variable' instead if tha= t is required." (error "%S is already a quote" x))) value) (custom-set-default variable value) - (setq csv-separator-chars (mapcar #'string-to-char value) - csv--skip-chars (apply #'concat "^\n" csv-separators) - csv-separator-regexp (apply #'concat `("[" ,@value "]")) - csv-font-lock-keywords - ;; NB: csv-separator-face variable evaluates to itself. - `((,csv-separator-regexp (0 'csv-separator-face)))))) + (setq csv-separator-chars (mapcar #'string-to-char value)) + (let ((quoted-value (mapcar #'regexp-quote value))) + (setq csv--skip-chars (apply #'concat "^\n" quoted-value)) + (setq csv-separator-regexp + (apply #'concat `("[" ,@quoted-value "]")))) + (setq csv-font-lock-keywords + ;; NB: csv-separator-face variable evaluates to itself. + `((,csv-separator-regexp (0 'csv-separator-face)))))) =20 (defcustom csv-field-quotes '("\"") "Field quotes: a list of *single-character* strings. @@ -368,6 +372,24 @@ It must be either a string or nil." (modify-syntax-entry ?\n ">" csv-mode-syntax-table)) (setq csv-comment-start string)) =20 +(defvar csv--set-separator-history nil) + +(defun csv-set-separator (sep) + "Set the CSV separator in the current buffer to SEP." + (interactive (list (read-char-from-minibuffer + "Separator: " nil 'csv--set-separator-history))) + (when (and (boundp 'csv-field-quotes) + (member (string sep) csv-field-quotes)) + (error "%c is already a quote" sep)) + (setq-local csv-separators (list (string sep))) + (setq-local csv-separator-chars (list sep)) + (let ((quoted-sep (regexp-quote (string sep)))) + (setq-local csv--skip-chars (format "^\n%s" quoted-sep)) + (setq-local csv-separator-regexp (format "[%s]" quoted-sep))) + (setq-local csv-font-lock-keywords + `((,csv-separator-regexp (0 'csv-separator-face)))) + (font-lock-refresh-defaults)) + ;;;###autoload (add-to-list 'auto-mode-alist '("\\.[Cc][Ss][Vv]\\'" . csv-mode)) =20 @@ -1728,6 +1750,104 @@ setting works better)." (jit-lock-unregister #'csv--jit-align) (csv--jit-unalign (point-min) (point-max)))) (csv--header-flush)) + +;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +;;; Separator guessing +;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + +(defvar csv--preferred-separators + '(?\t ?\s ?, ?: ?\;) + "Preferred separator characters in case of a tied score.") + +(defun csv-guess-set-separator () + "Guess and set the CSV separator of the current buffer. + +Add it to the mode hook to have CSV mode guess and set the +separator automatically when visiting a buffer: + + (add-hook \\=3D'csv-mode-hook \\=3D'csv-guess-set-separator)" + (interactive) + (let ((sep (csv-guess-separator + (buffer-substring-no-properties + (point-min) + ;; We're probably only going to look at the first 2048 + ;; or so chars, but take more than we probably need to + ;; minimize the chance of breaking the input in the + ;; middle of a (long) row. + (min 8192 (point-max))) + 2048))) + (when sep + (csv-set-separator sep)))) + +(defun csv-guess-separator (text &optional cutoff) + "Return a guess of which character is the CSV separator in TEXT." + (let ((best-separator nil) + (best-score 0)) + (dolist (candidate (csv--separator-candidates text cutoff)) + (let ((candidate-score + (csv--separator-score candidate text cutoff))) + (when (or (> candidate-score best-score) + (and (=3D candidate-score best-score) + (member candidate csv--preferred-separators))) + (setq best-separator candidate) + (setq best-score candidate-score)))) + best-separator)) + +(defun csv--separator-candidates (text &optional cutoff) + "Return a list of candidate CSV separators in TEXT. +When CUTOFF is passed, look only at the first CUTOFF number of characters." + (let ((chars (make-hash-table))) + (dolist (c (string-to-list + (if cutoff + (substring text 0 (min cutoff (length text))) + text))) + (when (and (not (gethash c chars)) + (or (=3D c ?\t) + (and (not (member c '(?. ?/ ?\" ?'))) + (not (member (get-char-code-property c 'general-= category) + '(Lu Ll Lt Lm Lo Nd Nl No Ps Pe Cc = Co)))))) + (puthash c t chars))) + (hash-table-keys chars))) + +(defun csv--separator-score (separator text &optional cutoff) + "Return a score on how likely SEPARATOR is a separator in TEXT. + +When CUTOFF is passed, stop the calculation at the next whole +line after having read CUTOFF number of characters. + +The scoring is based on the idea that most CSV data is tabular, +i.e. separators should appear equally often on each line. +Furthermore, more commonly appearing characters are scored higher +than those who appear less often. + +Adapted from the paper \"Wrangling Messy CSV Files by Detecting +Row and Type Patterns\" by Gerrit J.J. van den Burg , Alfredo +Naz=C3=A1bal, and Charles Sutton: https://arxiv.org/abs/1811.11242." + (let ((groups + (with-temp-buffer + (csv-set-separator separator) + (save-excursion + (insert text)) + (let ((groups (make-hash-table)) + (chars-read 0)) + (while (and (/=3D (point) (point-max)) + (or (not cutoff) + (< chars-read cutoff))) + (let* ((lep (line-end-position)) + (nfields (length (csv--collect-fields lep)))) + (cl-incf (gethash nfields groups 0)) + (cl-incf chars-read (- lep (point))) + (goto-char (+ lep 1)))) + groups))) + (sum 0)) + (maphash + (lambda (length num) + (cl-incf sum (* num (/ (- length 1) (float length))))) + groups) + (let ((unique-groups (hash-table-count groups))) + (if (=3D 0 unique-groups) + 0 + (/ sum unique-groups))))) =20 ;;; TSV support =20 --=20 2.35.1 --=-=-=-- ------------=_1652385602-5031-1--