GNU bug report logs - #73484
31.0.50; Abolishing etags-regen-file-extensions

Previous Next

Package: emacs;

Reported by: Sean Whitton <spwhitton <at> spwhitton.name>

Date: Wed, 25 Sep 2024 19:41:01 UTC

Severity: wishlist

Found in version 31.0.50

Full log


View this message in rfc822 format

From: Eli Zaretskii <eliz <at> gnu.org>
To: Francesco Potortì <pot <at> gnu.org>
Cc: dmitry <at> gutov.dev, 73484 <at> debbugs.gnu.org, spwhitton <at> spwhitton.name
Subject: bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
Date: Thu, 10 Oct 2024 08:41:52 +0300
> From: Francesco Potortì <pot <at> gnu.org>
> Date: Thu, 10 Oct 2024 03:07:31 +0200
> Cc: 73484 <at> debbugs.gnu.org,
> 	spwhitton <at> spwhitton.name,
> 	Eli Zaretskii <eliz <at> gnu.org>
> 
> >+  /* /\* If the canonicalized uncompressed name */
> >+  /*    has already been dealt with, skip it silently. *\/ */
> >+  /* for (fdp = fdhead; fdp != NULL; fdp = fdp->next) */
> >+  /*   { */
> >+  /*     assert (fdp->infname != NULL); */
> >+  /*     if (streq (uncompressed_name, fdp->infname)) */
> >+  /* 	goto cleanup; */
> >+  /*   } */
> >
> >    inf = fopen (file, "r" FOPEN_BINARY);
> >    if (inf)
> >
> >This is basically a "uniqueness" operation using linear search, O(N^2).
> 
> This is only for dealing with the case when the same file exists in both compressed and uncompressed form, and we are currently hitting the second one.  In that case, we should skip it.  Yes, this is a uniqueness test and yes, it is O^2 in the number of file names, but I doubt that this can explain a serious slowdown.

Are you sure this is executed only for compressed files?  Maybe I'm
missing something, but that's not my reading of the code:

  compr = get_compressor_from_suffix (file, &ext);
  if (compr)
    {
      compressed_name = file;
      uncompressed_name = savenstr (file, ext - file);
    }
  else
    {
      compressed_name = NULL;
      uncompressed_name = file;
    }

  /* If the canonicalized uncompressed name
     has already been dealt with, skip it silently. */
  for (fdp = fdhead; fdp != NULL; fdp = fdp->next)
    {
      assert (fdp->infname != NULL);
      if (streq (uncompressed_name, fdp->infname))
	goto cleanup;
    }

As you see, if the file is not compressed by any known method, the
code sets compressed_name to NULL and uncompressed_name to the
canonicalized file.  But the loop doesn't test compressed_name, so it
is executed for all the files, compressed and uncompressed.  Thus, I
believe the intent is to avoid duplicate tags if the same file was
encountered twice in some way.

Note that canonicalize_filename in this case doesn't really do what
its name seems to imply, e.g., relative file names will generally stay
relative.  So specifying the same file once as relative and the other
time as absolute will still process the file more than once.  We need
to use an inode test or equivalent, and probably use realpath or
equivalent, to make the duplicate test reliable.  Or maybe having the
same file processed under different names is okay, since TAGS is for
helping Emacs find the file, and so using relative names and symlinks
is okay?

> >Is there a hash table we could use?
> 
> No, we have a hash table for C tags, and that's all.  It is useful because there are 34 keywords against which most strings in a C/C++ file are compared.  It makes sesns to build hash tables for other languages where a similar situation happens.

The hash table we have was build by gperf, and that method can only be
used for fixed sets of strings known in advance.  We need a different
hash table for storing file names.

> I do not think that it makes sense to build a hash table for file names given on the command line, because the number of comparisons made on those names is generally vastly inferior to the number of comparisons used to search for tags.

That's not what I see in the code.  But it should be easy to count the
number of loop iterations in the use case we are talking about
(running etags on the geck-dev tree), so we don't need to argue about
facts.

> >>   . Some files have their language identified by means other than their
> >>     names or extensions: those are the languages that have
> >>     "interpreters" defined in etags.c
> 
> The interpreter is the token what comes after #!, with The possible exception for "env", in which case the interpreter is the second token after #!
> 
> There are two O^2 test in the number of tags in C/C++ files which depend on the two options "no-line-directive" and "no-duplicates".  Both options are usable to disable those checks and both are off by default because they help producing a more sane tags file and have no practical impact in most cases.  Both are there because, in principle, they cause significant slowdown in huge tags files.

AFAIU, --no-duplicates is only for ctags, not for etags.  I don't see
how --no-duplicates could be relevant to the loop described above.  Am
I missing something?




This bug report was last modified 225 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.