#73484 - 31.0.50; Abolishing etags-regen-file-extensions

GNU bug report logs - #73484
31.0.50; Abolishing etags-regen-file-extensions

Package: emacs;

Reported by: Sean Whitton <spwhitton <at> spwhitton.name>

Date: Wed, 25 Sep 2024 19:41:01 UTC

Severity: wishlist

Found in version 31.0.50

View this message in rfc822 format

From: Eli Zaretskii <eliz <at> gnu.org> To: Francesco Potortì <pot <at> gnu.org> Cc: dmitry <at> gutov.dev, 73484 <at> debbugs.gnu.org, spwhitton <at> spwhitton.name Subject: bug#73484: 31.0.50; Abolishing etags-regen-file-extensions Date: Thu, 10 Oct 2024 08:41:52 +0300

> From: Francesco Potortì <pot <at> gnu.org> > Date: Thu, 10 Oct 2024 03:07:31 +0200 > Cc: 73484 <at> debbugs.gnu.org, > spwhitton <at> spwhitton.name, > Eli Zaretskii <eliz <at> gnu.org> > > >+ /* /\* If the canonicalized uncompressed name */ > >+ /* has already been dealt with, skip it silently. *\/ */ > >+ /* for (fdp = fdhead; fdp != NULL; fdp = fdp->next) */ > >+ /* { */ > >+ /* assert (fdp->infname != NULL); */ > >+ /* if (streq (uncompressed_name, fdp->infname)) */ > >+ /* goto cleanup; */ > >+ /* } */ > > > > inf = fopen (file, "r" FOPEN_BINARY); > > if (inf) > > > >This is basically a "uniqueness" operation using linear search, O(N^2). > > This is only for dealing with the case when the same file exists in both compressed and uncompressed form, and we are currently hitting the second one. In that case, we should skip it. Yes, this is a uniqueness test and yes, it is O^2 in the number of file names, but I doubt that this can explain a serious slowdown. Are you sure this is executed only for compressed files? Maybe I'm missing something, but that's not my reading of the code: compr = get_compressor_from_suffix (file, &ext); if (compr) { compressed_name = file; uncompressed_name = savenstr (file, ext - file); } else { compressed_name = NULL; uncompressed_name = file; } /* If the canonicalized uncompressed name has already been dealt with, skip it silently. */ for (fdp = fdhead; fdp != NULL; fdp = fdp->next) { assert (fdp->infname != NULL); if (streq (uncompressed_name, fdp->infname)) goto cleanup; } As you see, if the file is not compressed by any known method, the code sets compressed_name to NULL and uncompressed_name to the canonicalized file. But the loop doesn't test compressed_name, so it is executed for all the files, compressed and uncompressed. Thus, I believe the intent is to avoid duplicate tags if the same file was encountered twice in some way. Note that canonicalize_filename in this case doesn't really do what its name seems to imply, e.g., relative file names will generally stay relative. So specifying the same file once as relative and the other time as absolute will still process the file more than once. We need to use an inode test or equivalent, and probably use realpath or equivalent, to make the duplicate test reliable. Or maybe having the same file processed under different names is okay, since TAGS is for helping Emacs find the file, and so using relative names and symlinks is okay? > >Is there a hash table we could use? > > No, we have a hash table for C tags, and that's all. It is useful because there are 34 keywords against which most strings in a C/C++ file are compared. It makes sesns to build hash tables for other languages where a similar situation happens. The hash table we have was build by gperf, and that method can only be used for fixed sets of strings known in advance. We need a different hash table for storing file names. > I do not think that it makes sense to build a hash table for file names given on the command line, because the number of comparisons made on those names is generally vastly inferior to the number of comparisons used to search for tags. That's not what I see in the code. But it should be easy to count the number of loop iterations in the use case we are talking about (running etags on the geck-dev tree), so we don't need to argue about facts. > >> . Some files have their language identified by means other than their > >> names or extensions: those are the languages that have > >> "interpreters" defined in etags.c > > The interpreter is the token what comes after #!, with The possible exception for "env", in which case the interpreter is the second token after #! > > There are two O^2 test in the number of tags in C/C++ files which depend on the two options "no-line-directive" and "no-duplicates". Both options are usable to disable those checks and both are off by default because they help producing a more sane tags file and have no practical impact in most cases. Both are there because, in principle, they cause significant slowdown in huge tags files. AFAIU, --no-duplicates is only for ctags, not for etags. I don't see how --no-duplicates could be relevant to the loop described above. Am I missing something?

This bug report was last modified 320 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #73484 31.0.50; Abolishing etags-regen-file-extensions

GNU bug report logs - #73484
31.0.50; Abolishing etags-regen-file-extensions