#47885 - [PATCH] org-table-import: Make it more smarter for interactive use

GNU bug report logs - #47885
[PATCH] org-table-import: Make it more smarter for interactive use

Reported by: Utkarsh Singh <utkarsh190601 <at> gmail.com>

Date: Mon, 19 Apr 2021 04:44:02 UTC

Severity: normal

Tags: patch

Message #14 received at 47885 <at> debbugs.gnu.org (full text, mbox):

From: Nicolas Goaziou <mail <at> nicolasgoaziou.fr> To: Utkarsh Singh <utkarsh190601 <at> gmail.com> Cc: 47885 <at> debbugs.gnu.org, emacs-orgmode <at> gnu.org Subject: Re: [PATCH] org-table-import: Make it more smarter for interactive use Date: Tue, 20 Apr 2021 15:40:12 +0200

Hello, Utkarsh Singh <utkarsh190601 <at> gmail.com> writes: > At first I was also reluctant in creating a new function but decided to > do so because: > > + org-table-convert-region is currently doing two thing 'guessing the > separator' and 'converting the region'. I thought it was a good idea to > separate out function into it's atomic operations. I understand, but there is sometimes a (difficult) line to draw between "separating concerns" and "function proliferation". Anyway, that's fine here. > + Current guessing technique is quite basic as it assumes that data > (file that has to be imported) has no error/inconsistency in it. I > would like to show you the doc string of Python's CSV library > implementation to guess separator (region inside """): > > """ > Looks for text enclosed between two identical quotes > (the probable quotechar) which are preceded and followed > by the same character (the probable delimiter). > For example: > ,'some text', > The quote with the most wins, same with the delimiter. > If there is no quotechar the delimiter can't be determined > this way. > """ > > And if this functions fails then we have: > > """ > The delimiter /should/ occur the same number of times on > each row. However, due to malformed data, it may not. We don't want > an all or nothing approach, so we allow for small variations in this > number. > 1) build a table of the frequency of each character on every line. > 2) build a table of frequencies of this frequency (meta-frequency?), > e.g. 'x occurred 5 times in 10 rows, 6 times in 1000 rows, > 7 times in 2 rows' > 3) use the mode of the meta-frequency to determine the /expected/ > frequency for that character > 4) find out how often the character actually meets that goal > 5) the character that best meets its goal is the delimiter > For performance reasons, the data is evaluated in chunks, so it can > try and evaluate the smallest portion of the data possible, evaluating > additional chunks as necessary. > """ For the problem we're trying to solve, this sounds like over-engineering to me. Do we want so badly to guess a separator? > I tried to do similar in Elisp but currently facing some issues due to > my inexperience in functional programming. Also moving the 'guessing' > part out the function may lead to development of even better algorithm > than Python counterpart. > > Modified version of concerned function: > > (defun org-table-guess-separator (beg0 end0) > "Guess separator for `org-table-convert-region' for region BEG0 to END0. > > List of preferred separator: > comma, TAB, semicolon, colon or SPACE. > > If region contains a line which doesn't contain the required > separator then discard the separator and search again using next > separator." > (let* ((beg (save-excursion > (goto-char (min beg0 end0)) > (line-beginning-position))) > (end (save-excursion > (goto-char (max beg0 end0)) > (line-end-position))) Thinking again about it, this needs extra care, as end0 might end up on an empty line. You tried to avoid this in your first function, but I think this was not sufficient either. Actually, beg0 could also start on an empty line. This needs to be tested extensively, but as a first approximation, I think `beg' needs to be defined as: (save-excursion (goto-char (min beg0 end0)) (skip-chars-forward " \t\n") (if (eobp) (point) (line-beginning-position))) and `end' as (save-excursion (goto-char (max beg end0)) (skip-chars-backward " \t\n" beg) (if (= beg (point)) (point) (line-end-position))) Then you need to bail out if beg = end. > (sep-rexp '(("," "^[^\n,]+$") sep-rexp -> sep-regexp > ("\t" "^[^\n\t]+$") > (";" "^[^\n;]+$") > (":" "^[^\n:]+$") > (" " "^\$[^'\"][^\n\s][^'\"]\$+$"))) At this point, I suggest to use `rx' macro instead. > (tmp (car sep-rexp)) > sep) > (save-excursion > (goto-char beg) > (while (and (not sep) > (if (save-excursion > (not (re-search-forward (nth 1 tmp) end t))) > (setq sep (nth 0 tmp)) > (setq sep-rexp (cdr sep-rexp)) > (setq tmp (car sep-rexp))))) I suggest this (yes, I like pattern-matching, `car' and `cdr' are so 80's) instead: (save-excursion (goto-char beg) (catch :found (pcase-dolist (`(,sep ,regexp) sep-regexp) (save-excursion (unless (re-search-forward regexp end t) (throw :found sep)))) nil)) Again all this needs to extensively tested, as there are a lot of dangers lurking around. Regards, -- Nicolas Goaziou

This bug report was last modified 4 years and 103 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #47885 [PATCH] org-table-import: Make it more smarter for interactive use

GNU bug report logs - #47885
[PATCH] org-table-import: Make it more smarter for interactive use