GNU bug report logs - #11197
problems with string ports and unicode

Previous Next

Package: guile;

Reported by: Klaus Stehle <klaus.stehle <at> uni-tuebingen.de>

Date: Sat, 7 Apr 2012 20:09:01 UTC

Severity: normal

Done: ludo <at> gnu.org (Ludovic Courtès)

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: ludo <at> gnu.org (Ludovic Courtès)
To: Mark H Weaver <mhw <at> netris.org>
Cc: 11197 <at> debbugs.gnu.org, Klaus Stehle <klaus.stehle <at> uni-tuebingen.de>
Subject: bug#11197: problems with string ports and unicode
Date: Wed, 11 Apr 2012 23:01:16 +0200
[Message part 1 (text/plain, inline)]
Hi Mark,

Mark H Weaver <mhw <at> netris.org> skribis:

> Okay, now I understand.  The problem is that internally, string ports
> are implemented by converting the string into a stream of bytes in the
> string port's encoding, and then the string port reads those bytes.

Exactly.

[...]

> Conceptually, a string port is a textual port, not a binary port.

But not in Guile, where there’s no distinction between textual and
binary ports.  One can write code like:

  scheme@(guile-user)> (define (string->utf16 s)
                         (let ((p (with-fluids ((%default-port-encoding "UTF-16BE"))
                                    (open-input-string s))))
                           (get-bytevector-all p)))
  scheme@(guile-user)> (string->utf16 "hello")
  $4 = #vu8(0 104 0 101 0 108 0 108 0 111)
  scheme@(guile-user)> (use-modules(rnrs bytevectors))
  scheme@(guile-user)> (utf16->string $4)
  $5 = "hello"

> You should be able to hand it an arbitrary string and read those
> characters from it, as described in SRFI-6, without setting
> Guile-specific fluid variables.  Similarly, you should be able to
> write arbitrary characters to a string-output-port.

The SRFI-6 issue could be addressed with:

[Message part 2 (text/x-patch, inline)]
diff --git a/module/srfi/srfi-6.scm b/module/srfi/srfi-6.scm
index 098b586..ba946ec 100644
--- a/module/srfi/srfi-6.scm
+++ b/module/srfi/srfi-6.scm
@@ -1,6 +1,6 @@
 ;;; srfi-6.scm --- Basic String Ports
 
-;; 	Copyright (C) 2001, 2002, 2003, 2006 Free Software Foundation, Inc.
+;; 	Copyright (C) 2001, 2002, 2003, 2006, 2012 Free Software Foundation, Inc.
 ;;
 ;; This library is free software; you can redistribute it and/or
 ;; modify it under the terms of the GNU Lesser General Public
@@ -23,10 +23,16 @@
 ;;; Code:
 
 (define-module (srfi srfi-6)
-  #:re-export (open-input-string open-output-string get-output-string))
+  #:export (open-input-string open-output-string)
+  #:re-export (get-output-string))
 
-;; Currently, guile provides these functions by default, so no action
-;; is needed, and this file is just a placeholder.
+(define (open-input-string s)
+  (with-fluids ((%default-port-encoding "UTF-8"))
+    ((@ (guile) open-input-string) s)))
+
+(define (open-output-string)
+  (with-fluids ((%default-port-encoding "UTF-8"))
+    ((@ (guile) open-output-string))))
 
 (cond-expand-provide (current-module) '(srfi-6))
[Message part 3 (text/plain, inline)]
It wouldn’t completely solve the problem.

> IMO, string ports should use UTF-8 as their initial port encoding, since
> we know that UTF-8 can represent any Guile string.  This will allow
> portable use of string ports.

The change was submitted and briefly discussed at
<http://thread.gmane.org/gmane.lisp.guile.devel/9822>.

I think the rationale was mostly backward compatibility (in 1.8 people
could mix Latin-1 textual and binary I/O), consistency with how other
ports behave, and the ability to change the default encoding of string
ports.

> I realize that this would change the existing behavior of programs that
> use binary I/O on string ports, but as things stand right now, portable
> SRFI-6 code is broken on Guile.
>
> What do you think?

In hindsight, UTF-8 does seem like a better default than the locale port
encoding (which is what %default-port-encoding is, by default), but it
does remain useful to specify a different encoding.

>>> What _is_ needed is a file coding declaration near the top of the source
>>> file, e.g. "coding: utf-8" (see "Character Encoding of Source Files" in
>>> the manual).
>>
>> Yes.  And you actually need both–i.e., the ‘coding’ cookie won’t
>> magically make string ports use that encoding.
>>
>>> I tried that and it still fails for me.
>>
>> What fails exactly?
>
> It fails ungracefully (goes into an infinite while trying to print the
> backtrace) without the %default-port-encoding setting.

Indeed, it’s stuck in a deadlock:

--8<---------------cut here---------------start------------->8---
(gdb) bt
#0  0x00007ffff75e1204 in __lll_lock_wait () from /nix/store/vxycd107wjbhcj720hzkw2px7s7kr724-glibc-2.12.2/lib/libpthread.so.0
#1  0x00007ffff75dc4d4 in _L_lock_999 () from /nix/store/vxycd107wjbhcj720hzkw2px7s7kr724-glibc-2.12.2/lib/libpthread.so.0
#2  0x00007ffff75dc2ea in pthread_mutex_lock () from /nix/store/vxycd107wjbhcj720hzkw2px7s7kr724-glibc-2.12.2/lib/libpthread.so.0
#3  0x00007ffff7b30499 in scm_dynwind_pthread_mutex_lock (mutex=0x7ffff7dd28c0) at threads.c:1962
#4  0x00007ffff7b2bb0e in scm_mkstrport (pos=0x2, str=0x4, modes=327680, caller=<value optimized out>) at strports.c:287
#5  0x00007ffff7aac20b in display_backtrace_body (a=0x7fffffffc1a0) at backtrace.c:487
#6  0x00007ffff7b46c7b in vm_regular_engine (vm=0x6f61f0, program=0x7f5d50, argv=0x6fa3b0, nargs=-1) at vm-i-system.c:895
#7  0x00007ffff7ac039e in scm_call_3 (proc=0x7f5d50, arg1=<value optimized out>, arg2=<value optimized out>, arg3=<value optimized out>) at eval.c:500
#8  0x00007ffff7b32504 in scm_internal_catch (tag=<value optimized out>, body=<value optimized out>, body_data=<value optimized out>, handler=<value optimized out>, handler_data=<value optimized out>) at throw.c:222
#9  0x00007ffff7aabbba in scm_display_backtrace_with_highlights (stack=<value optimized out>, port=<value optimized out>, first=<value optimized out>, depth=<value optimized out>, highlights=<value optimized out>)
    at backtrace.c:558
#10 0x00007ffff7ab725e in print_exception_and_backtrace (error_port=0x6f6170, tag=0x66d4c0, args=0x8e6ea0) at continuations.c:490
#11 pre_unwind_handler (error_port=0x6f6170, tag=0x66d4c0, args=0x8e6ea0) at continuations.c:534
#12 0x00007ffff7b46c7b in vm_regular_engine (vm=0x6f61f0, program=0x7f3ce0, argv=0x6fa300, nargs=-1) at vm-i-system.c:895
#13 0x00007ffff7b4846e in scm_call_with_vm (vm=0x6f61f0, proc=0x7f3ce0, args=<value optimized out>) at vm.c:878
#14 0x00007ffff7b296db in scm_to_stringn (str=0x8dba80, lenp=0x7fffffffc4e8, encoding=<value optimized out>, handler=SCM_FAILED_CONVERSION_ERROR) at strings.c:2102
#15 0x00007ffff7b2bb73 in scm_mkstrport (pos=0x2, str=0x8dba80, modes=196608, caller=<value optimized out>) at strports.c:312
--8<---------------cut here---------------end--------------->8---

This could be fixed by calling ‘scm_new_port_table_entry’ after having
prepared the backing buffer, but the problem is that ‘pt->encoding’ is
needed before.

Thoughts?

Ludo’.

This bug report was last modified 12 years and 334 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.