From unknown Mon Jun 23 07:53:11 2025 X-Loop: help-debbugs@gnu.org Subject: bug#40540: Faster sort with locale Resent-From: Ole Tange Original-Sender: "Debbugs-submit" Resent-CC: bug-coreutils@gnu.org Resent-Date: Fri, 10 Apr 2020 13:20:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 40540 X-GNU-PR-Package: coreutils X-GNU-PR-Keywords: To: 40540@debbugs.gnu.org X-Debbugs-Original-To: bug-coreutils@gnu.org Received: via spool by submit@debbugs.gnu.org id=B.158652477924369 (code B ref -1); Fri, 10 Apr 2020 13:20:01 +0000 Received: (at submit) by debbugs.gnu.org; 10 Apr 2020 13:19:39 +0000 Received: from localhost ([127.0.0.1]:54883 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1jMtZL-0006Kz-0n for submit@debbugs.gnu.org; Fri, 10 Apr 2020 09:19:39 -0400 Received: from lists.gnu.org ([209.51.188.17]:34017) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1jMtZH-0006Km-DV for submit@debbugs.gnu.org; Fri, 10 Apr 2020 09:19:37 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:42741) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1jMtZF-0006nO-Jj for bug-coreutils@gnu.org; Fri, 10 Apr 2020 09:19:34 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE autolearn=disabled version=3.3.2 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1jMtZD-0004jm-TS for bug-coreutils@gnu.org; Fri, 10 Apr 2020 09:19:33 -0400 Received: from mail-oi1-f171.google.com ([209.85.167.171]:45943) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1jMtZD-0004j6-Op for bug-coreutils@gnu.org; Fri, 10 Apr 2020 09:19:31 -0400 Received: by mail-oi1-f171.google.com with SMTP id k133so643353oih.12 for ; Fri, 10 Apr 2020 06:19:31 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=9Y136JiQrqVWGfozzu5j4E4p9m1o7tRkaxHWMaPXdK8=; b=MYLn6zq3czhMv8czQaAzcfERV7g67t+R+cVlEHvrlawbuv6/SqA16W11KmiWLsAAf2 QCImmeTfgvEZo5u1GuM3K7D5e0GzCkM2XktixAfhBjUjqv3CK2Z8dFL3izfGPDc5iMWQ oHxGIE/ANFe/DRS+ZX4x9KnjbKokfCQyCLX5l1ZtaXPNJxiVCCaL3P1vm9O0nk18gi3x 3TjmEgNa3YgSBmDLaK0/DHfBAdCuPm54ZYQp72f0TBGXwYde7a41XhvwQoPSmU2CHvtR /tpUr2nnTbL2eb/iAVVxovEOTaVtvsIm1MjrDM5oIYXKc0HcK9DNp/tM1ARzlDvLAt7h SWcw== X-Gm-Message-State: AGi0PuZOHz5yT1RBV5C1YT7N3OUNGd+JO9iNrRALG70RW3StiYgkpGZN d899aZXMXU4mYs7RSuSde96TcWTk4B58xJAZmO8wy4Ay X-Google-Smtp-Source: APiQypJ7Ff53GLmTzzfGWPIUe+NZfS3Ni2kCGgLk1CqJY6Kr+Hee4JnQKwkqCi7KqZpKdHFAKGPmiB9zZ5lbxIVdzZ4= X-Received: by 2002:a05:6808:11:: with SMTP id u17mr747173oic.87.1586524770486; Fri, 10 Apr 2020 06:19:30 -0700 (PDT) MIME-Version: 1.0 From: Ole Tange Date: Fri, 10 Apr 2020 15:19:19 +0200 Message-ID: Content-Type: text/plain; charset="UTF-8" X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-Received-From: 209.85.167.171 X-Spam-Score: 2.8 (++) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has NOT identified this incoming email as spam. The original message has been attached to this so you can view it or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: I have noticed that if locale is set, then sort becomes much slower. I imagine that it is because instead of doing simple_compare(string1, string2) Content analysis details: (2.8 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- 0.2 HEADER_FROM_DIFFERENT_DOMAINS From and EnvelopeFrom 2nd level mail domains are different 0.0 SPF_HELO_NONE SPF: HELO does not publish an SPF Record 1.0 SPF_SOFTFAIL SPF: sender does not match SPF record (softfail) 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider (ole.tange[at]gmail.com) -0.7 RCVD_IN_DNSWL_LOW RBL: Sender listed at https://www.dnswl.org/, low trust [209.51.188.17 listed in list.dnswl.org] 0.2 FREEMAIL_FORGED_FROMDOMAIN 2nd level domains in From and EnvelopeFrom freemail headers are different 2.0 SPOOFED_FREEMAIL No description available. X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.2 (/) I have noticed that if locale is set, then sort becomes much slower. I imagine that it is because instead of doing simple_compare(string1,string2) it does: localized_compare(string1,string2) But would it be possible to convert the input string1 into a string in a generalized format, which would sort the same way as the localized sort, but using a simple compare? Like this: string1_general = localize(string1) string2_general = localize(string2) simple_compare(string1_general,string2_general) If that is possible, then localize() can be done by other cores in advance and thereby offload the "primary" core. /Ole From unknown Mon Jun 23 07:53:11 2025 X-Loop: help-debbugs@gnu.org Subject: bug#40540: Faster sort with locale Resent-From: Paul Eggert Original-Sender: "Debbugs-submit" Resent-CC: bug-coreutils@gnu.org Resent-Date: Fri, 10 Apr 2020 18:57:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 40540 X-GNU-PR-Package: coreutils X-GNU-PR-Keywords: To: Ole Tange Cc: 40540@debbugs.gnu.org Received: via spool by 40540-submit@debbugs.gnu.org id=B40540.158654501832719 (code B ref 40540); Fri, 10 Apr 2020 18:57:02 +0000 Received: (at 40540) by debbugs.gnu.org; 10 Apr 2020 18:56:58 +0000 Received: from localhost ([127.0.0.1]:56025 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1jMypm-0008Vf-LW for submit@debbugs.gnu.org; Fri, 10 Apr 2020 14:56:58 -0400 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:37486) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1jMypk-0008VO-Os for 40540@debbugs.gnu.org; Fri, 10 Apr 2020 14:56:57 -0400 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id EACF91600D0; Fri, 10 Apr 2020 11:56:48 -0700 (PDT) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id KPzyM6ywGIWF; Fri, 10 Apr 2020 11:56:48 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 56CBF1600D9; Fri, 10 Apr 2020 11:56:48 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id cnxscFDH-_ZO; Fri, 10 Apr 2020 11:56:48 -0700 (PDT) Received: from [192.168.1.9] (cpe-23-242-74-103.socal.res.rr.com [23.242.74.103]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 2C15F1600D0; Fri, 10 Apr 2020 11:56:48 -0700 (PDT) References: From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: <83d22efe-4e2b-4823-e1a3-08bb594654e3@cs.ucla.edu> Date: Fri, 10 Apr 2020 11:56:47 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.4.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) On 4/10/20 6:19 AM, Ole Tange wrote: > But would it be possible to convert the input string1 into a string in > a generalized format, which would sort the same way as the localized > sort, but using a simple compare? I tried doing that a long time ago by using strxfrm, but it made 'sort' significantly slower. You're welcome to try again; perhaps things have changed.