I found it when I was testing various new regular expression algorithms against grep (which I used as the golden standard for this).
I used a random generator for regular expressions (using the QuickCheck framework) and then shrinking/delta debugging to automatically find the smallest failing test case.
BTW, if you are interested, I could do a larger more targeted effort stress testing grep like this and possibly find more test cases with unexpected behavior. I would need some guidance on where to put most effort in order to be as useful as this can be. I could find a MSc student to help out with that. Let me know if this sounds like an interesting thing to do!
kind regards,
/Koen