GNU bug report logs - #78528
[PATCH v1] calc: Allow strings with higher character codes

Previous Next

Package: emacs;

Reported by: "Jacob S. Gordon" <jacob.as.gordon <at> gmail.com>

Date: Wed, 21 May 2025 07:04:03 UTC

Severity: normal

Tags: patch

Done: Eli Zaretskii <eliz <at> gnu.org>

Full log


View this message in rfc822 format

From: help-debbugs <at> gnu.org (GNU bug Tracking System)
To: "Jacob S. Gordon" <jacob.as.gordon <at> gmail.com>
Subject: bug#78528: closed (Re: bug#78528: [PATCH v1] calc: Allow strings
 with higher character codes)
Date: Sat, 14 Jun 2025 14:14:13 +0000
[Message part 1 (text/plain, inline)]
Your bug report

#78528: [PATCH v1] calc: Allow strings with higher character codes

which was filed against the emacs package, has been closed.

The explanation is attached below, along with your original report.
If you require more details, please reply to 78528 <at> debbugs.gnu.org.

-- 
78528: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=78528
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems
[Message part 2 (message/rfc822, inline)]
From: Eli Zaretskii <eliz <at> gnu.org>
To: "Jacob S. Gordon" <jacob.as.gordon <at> gmail.com>
Cc: 78528-done <at> debbugs.gnu.org
Subject: Re: bug#78528: [PATCH v1] calc: Allow strings with higher character
 codes
Date: Sat, 14 Jun 2025 17:12:34 +0300
> Date: Mon, 2 Jun 2025 15:28:16 -0400
> Cc: 78528 <at> debbugs.gnu.org
> From: "Jacob S. Gordon" <jacob.as.gordon <at> gmail.com>
> 
> Hello,
> 
> On 2025-05-31 09:27 Eli Zaretskii wrote:
> > It's in my (too long, admittedly) queue.
> 
> No problem, thanks for the confirmation.
> 
> > Can you think of any possible downsides to installing the patch?
> 
> Nothing that I don’t think can be ironed out.
> 
> + The custom variable defaults to the previously hard‐coded value, so
> unless users change it, `calc' will act the same as before.
> 
> + This variable only affects the display of vectors‐of‐chars, and
> touches none of the underlying types (e.g., algebraic variables are
> still restricted to a basic Latin-Greek range through an independent
> parsing step).
> 
> + Allowing a higher maximum means that users can encounter characters
> without a fixed width, or contextual forms that change the rendered
> string length. Alignment/justification, and some elements of
> “compositions” assume fixed-width characters for their calculations,
> so their results can be off. Here are some representative examples
> from all the affected compositions (the extent is font‐dependent):
> 
>   + `choriz' (horizontal composition) optionally takes a `SEP' vector:
> 
>   #+begin_src calc
>   choriz([a b/c],"✕")
>   #+end_src
> 
>   #+begin_src text
>   1:  a✕b / c
>   #+end_src
> 
>   + Only the `crule' component of vertical compositions is affected,
>   which optionally takes a character to form the horizontal rule. For
>   example, comparing the em dash, hyphen-minus, and hyphen,
>   respectively, the hyphen rule isn’t full enough:
> 
>   #+begin_src calc
>   cvert([a + 1, cbase(crule("—")), b^2])
>   cvert([a + 1, cbase(crule("-")), b^2])
>   cvert([a + 1, cbase(crule("‐")), b^2])
>   #+end_src
> 
>   #+begin_src text
>   3:  a + 1
>       —————
>        b^2
>   2:  a + 1
>       -----
>        b^2
>   1:  a + 1
>       ‐‐‐‐‐
>        b^2
>   #+end_src
> 
>   + `cspace', `cvspace', `ctspace', `cbspace' all take strings as an
>   optional second argument to repeat some number of times, and will
>   behave similarly to `string' with respect to alignment.
> 
>   + `cwidth' counts characters, and will be different from the actual
>   length with variable-width characters or contextual forms. I’m less
>   familiar with vertically‐oriented scripts, but I imagine `cheight'
>   can suffer similarly with something like `cvspace'.
> 
>   + Any user‐defined compositions involving strings may be affected if
>   they make the same assumptions about string width, increase the
>   custom variable, and include offending characters.
> 
> + With the `calc-big-language' display mode (`d B'), but none of the
> other modes, pure RTL strings are aligned opposite to the LTR strings.
> 
> > In any case, to accept such a large contribution we'd need you to
> > sign the copyright assignment agreement (which you currently don't
> > have, AFAICT). If you are willing to do that, I will send you the
> > form to fill and the instructions to go with it.
> 
> That’s right, I haven’t signed the copyright assignment agreement yet,
> but I’m willing.

Thanks, since the copyright assignment paperwork is now done, I've
installed this on the master branch, and I'm closing the bug.

[Message part 3 (message/rfc822, inline)]
From: "Jacob S. Gordon" <jacob.as.gordon <at> gmail.com>
To: bug-gnu-emacs <at> gnu.org
Subject: [PATCH v1] calc: Allow strings with higher character codes
Date: Tue, 20 May 2025 21:34:02 -0400
[Message part 4 (text/plain, inline)]
Tags: patch

Hello all,

Please find below a feature proposal for strings in `calc', and a first
draft of a patch attached to this message.

Motivation
==========

Suppose you're working with Unicode code points in `calc', and you end
up with the following vector on the stack. You'd like to know what a
string composed of these character codes would look like, so you toggle
`calc-display-strings' (`d "') and … nothing happens.

,----
| 1:  [383, 117, 99, 99, 101, 383, 115]
`----

Later, in an `org-mode' file you have the following table with a list of
dates in the first column. Since [formulas] can be any algebraic
expression understood by `calc', and `calc' [understands dates], you try
to insert a Unicode character for rows where the first column is in the
past. When you evaluate the formula (`C-c C-c' on the `#+TBLFM:' line)
`calc' stops short of displaying the string.

,----
| | Date             | Past?           |
| |------------------+-----------------|
| | [2025-05-01 Thu] | string([10003]) |
| | [2026-05-01 Fri] |                 |
| #+TBLFM: $2 = if($1 < now(), string("✓"), string(""))
`----

Both of these problems are due to the fact that some or all of the
character codes are outside the `Latin-1' (8-bit) range. If we replace
this hard-coded limitation with a custom variable and increase its
value, both of these use-cases can be supported.

,----
| 1:  "ſucceſs"
`----

,----
| | Date             | Past? |
| |------------------+-------|
| | [2025-05-01 Thu] | ✓     |
| | [2026-05-01 Fri] |       |
| #+TBLFM: $2 = if($1 < now(), string("✓"), string(""))
`----

The alternative is that the user has to exit `calc' (or its syntax) and
dip into `Lisp':

,----
| (concat '(383 117 99 99 101 383 115))
`----

,----
| | Date             | Past? |
| |------------------+-------|
| | [2025-05-01 Thu] | ✓     |
| | [2026-05-01 Fri] |       |
| #+TBLFM: $2 = '(if (time-less-p (org-read-date t t $1) (current-time)) "✓" "")
`----

[formulas] <https://orgmode.org/manual/Formula-syntax-for-Calc.html>

[understands dates]
<https://www.gnu.org/software/emacs/manual/html_node/calc/Date-Forms.html>

Proposal & Impact
=================

The attached patch introduces a custom variable
`calc-string-maximum-character' (optimistically versioned for `31.1'),
which replaces a hard-coded maximum in the function
`math-vector-is-string'. This variable defaults to `0xFF' in order to
preserve the current behaviour, but otherwise can be any character up to
`(max-char)'. Since the vector contents are passed to
`math-vector-to-string', the Unicode-aware `concat' has no problem with
the higher characters:

,----
| (defun math-vector-to-string (a &optional quoted)
|   (setq a (concat (mapcar (lambda (x) (if (consp x) (nth 1 x) x))
|                           (cdr a))))
|   […])
`----

Here are the outstanding issues I've identified for discussion:

1. Since users can blow past the variable type and set
   `calc-string-maximum-character' to /anything/, I'm not sure the
   patch's error handling is enough. If a hapless user sets it to
   something invalid like a string (`"invalid"', let's say), then with
   the current patch they'll encounter at least two kinds of errors:

   a) With the following vector on the stack, executing
      `calc-display-strings' (`d "') will display `Wrong type argument:
      number-or-marker-p, "invalid"' in the minibuffer, /and/ enter a
      string display mode where the vector isn't rendered as seen in the
      second block below.

      ,----
      | 1:  [0, 1, 2]
      `----

      ,----
      | 1:  .
      `----

      Only executing `calc-display-strings' (`d "') again will toggle
      the display mode and show the original vector. This is a bad
      experience for the user, and should be mitigated by raising an
      error in `calc-display-strings' before the display mode is
      toggled.

   b) If a user tries to enter a string algebraically with
      `calc-algebraic-entry' (`''), say `string("abc")', the same
      message from the first error will appear in the minibuffer, but
      the string is not added to the stack. This is slightly cryptic,
      but not as bad an experience as the first error.

2. With a higher value of `calc-string-maximum-character', the displayed
   string could contain right-to-left or a bidirectional mixture of
   characters that could conceivably interfere with the `calc' alignment
   functions `calc-left-justify' (`d <'), `calc-center-justify' (`d ='),
   and `calc-right-justify' (`d >'). Toggling the display of the
   following vectors reveals a misalignment of the fully Arabic string
   under center justification, and misalignment of the full- and
   mixed-Arabic strings under right justification. None of these contain
   any of the funky bidirectional Unicode markers so I'm not sure if
   there's other problems lurking.

   ,----
   | 3:  [108, 101, 102, 116, 45, 116, 111, 45, 114, 105, 103, 104, 116]
   | 2:  [1605, 1606, 32, 1575, 1604, 1610, 1605, 1610, 1606, 32, 1573, 1604, 1609, 32, 1575, 1604, 1610, 1587, 1575, 1585]
   | 1:  [108, 101, 102, 116, 45, 1610, 1605, 1610, 1606]
   `----

   ,----
   | 3:  "left-to-right"
   | 2:  "من اليمين إلى اليسار"
   | 1:  "left-يمين"
   `----

   ,----
   | 3:                       "left-to-right"
   | 2:                   "من اليمين إلى اليسار"
   | 1:                         "left-يمين"
   `----

   ,----
   | 3:                                               "left-to-right"
   | 2:                                        "من اليمين إلى اليسار"
   | 1:                                                   "left-يمين"
   `----

   Also, combining diacritical marks appear as separate characters, but
   I'm not sure if this is the expected behaviour and/or related to my
   configuration.

   ,----
   | 1.  [117, 776]
   `----

   ,----
   | 1:  "ü"
   `----

3. I haven't found any internal references to `math-vector-is-string'
   that look like they could conflict with this change
   (`math-format-flat-expr-fancy', `math-compose-expr',
   `calc-kbd-query'). Existing references are mostly related to
   displaying strings from vectors, `string' or `bstring' objects, and
   composite objects involving vectors or strings, but I could use an
   extra set of eyes to confirm. Since `org-mode' uses `calc'
   expressions in tables, I might need to get their concurrence with the
   change. I'm unaware of any third-party dependencies on this function.

4. For unit tests, are there any naming conventions I should follow? I
   just stuck all of the tests in one place for `math-vector-is-string'.


Thanks for your consideration!

--
Jacob S. Gordon
jacob.as.gordon <at> gmail.com

=========================

Please avoid sending me HTML emails and MS Office documents.
https://useplaintext.email/#etiquette
https://www.gnu.org/philosophy/no-word-attachments.html

In GNU Emacs 30.1 (build 1, x86_64-pc-linux-gnu, GTK+ Version 3.24.49,
cairo version 1.18.4)
System Description: Arch Linux

Configured using:
 'configure --with-pgtk --sysconfdir=/etc --prefix=/usr
 --libexecdir=/usr/lib --localstatedir=/var --disable-build-details
 --with-cairo --with-harfbuzz --with-libsystemd --with-modules
 --with-native-compilation=aot --with-tree-sitter 'CFLAGS=-march=x86-64
 -mtune=generic -O2 -pipe -fno-plt -fexceptions -Wp,-D_FORTIFY_SOURCE=3
 -Wformat -Werror=format-security -fstack-clash-protection
 -fcf-protection -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer -g
 -ffile-prefix-map=/build/emacs/src=/usr/src/debug/emacs -flto=auto'
 'LDFLAGS=-Wl,-O1 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro
 -Wl,-z,now -Wl,-z,pack-relative-relocs -flto=auto'
 'CXXFLAGS=-march=x86-64 -mtune=generic -O2 -pipe -fno-plt -fexceptions
 -Wp,-D_FORTIFY_SOURCE=3 -Wformat -Werror=format-security
 -fstack-clash-protection -fcf-protection -fno-omit-frame-pointer
 -mno-omit-leaf-frame-pointer -Wp,-D_GLIBCXX_ASSERTIONS -g
 -ffile-prefix-map=/build/emacs/src=/usr/src/debug/emacs -flto=auto''

[v1-0001-calc-Allow-strings-with-higher-character-codes.patch (text/patch, attachment)]

This bug report was last modified 2 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.