G'day, Here's an updated "untangle" Lua script (http://www.lua.org/), resynchronised with the changes up until Grep 2.20 (specifically commit ...78f07b8c8e26, the post-release administrivia commit after the release). The original script (17 April) was fairly rough and incomplete in places; and was overshadowed by the many improvements being made in the lead-up to 2.19, especially the extremely impressive work from Norihiro, Paul and Jim at multiple levels to improve performance, documentation, and code consistency and clarity in many areas. I still believe that a more fundamental rework of dfa.c, based on making the token language more expressive, is worthwhile, but that such an invasive change could only occur if dfa.c was made easier to modify; hence, this untangle script. There are two major improvements over the 17 April version: 1. The modules now clean up after themselves, instead of leaving dangling memory blocks. Running "make check" with Valgrind, as per the script suggested by Jim in his 21 May message, now runs cleanly, although the "symlink" test hangs on my system for both the original and modified grep versions, for reasons that I haven't tried to track down. 2. I've created a new module, "mbcsets", which models multibyte character class sets at arm's length to users, in a similar fashion to the "charclass" module. The result is that the token output from fsaparse should now be identical to the token output from parse in dfa.c, whereas previously the comparison could only be made at the output of fsalex, as it did not share its internally-build mbcsets structures. There's quite a number of implementation differences that are worth inspecting in the untangled versions: 1. Charclass has moved from persistent class indices, but changing class pointers (due to realloc ()), to also explicitly supporting persistent pointers. This has facilitated better charclass caching in various places; 2. I'm still working towards a scenario where multiple lexers/parsers/etc could co-exist in different-locale and/or different-regex-option versions, with the locale captured when fsalex_syntax () is called by the client, and all other users rigorously obtain locale information directly or indirectly via the lexer. This is incomplete; one example that I haven't fixed yet is the using_utf8 () test in fsaparse; 3. find_pred (), in fsalex, exploits the charclass pointer guarantee to do lazy caching of predicate searches; 4. The treatment of \s and \S in fsalex_lex () has been rewritten to not use PUSH_LEX_STATE/POP_LEX_STATE, as it calls underlying resources such as find_pred directly. (Incidentally, the IS_WORD_CONSTITUENT implementation of \w/\W in dfa.c is quite different to the \s/\S treatment in the original dfa.c: Is this code correct in a multibyte environment?); 5. I reworked FETCH_WC/FETCH_CHAR only a few days before Paul and Norihiro mad extensive improvements; I've tried to integrate their changes into my version without compromising their excellent work; my version is sufficiently different, in documentation even if nowhere else, to be worth a look; and 6. My personal coding preference is to not guarantee single-exit functions, but instead, try to treat exceptional and/or simple cases early in the function, with an immediate function return. My hope is that this makes the remainder of the function easier to understand, as the reader knows that certain cases have been eliminated. In this vein, I (possibly rashly) decided to rewrite atom () and closure () in fsaparse; all feedback on this effort is welcome. As before, the code modifies dfa.c to create "dfa-prl.c", so that the original code and the new code can be run in "parallel", and the outputs compared. The comparison is logged in /tmp/parallel.log. The co-existence of new and old code means that the new code has to have explicit module name prefixes in many places, at least to avoid namespace clashes. This message contains the updated "untangle" Lua script, along with the "strictness" module, from LuaRocks, that I use to stop global variables (usually the result of typos) from being created freely. I'll post a follow-up message shortly, with the full set of created and/or modified files created by the script, for those that (probably quite wisely) may distrust Lua scripts from strangers. cheers, behoffski (Brenton Hoff) Programmer, Grouse Software