Norihiro Tanaka wrote:
> Could you try above cases?

Thanks, you're observing a 2.7x performance speedup with macros on your 
platform and your benchmark.  With the same patch, I observed only a 
1.18x speedup on the same benchmark.  As usual, I'm testing with AMD 
Phenom II X4 910e + GCC 4.9.0 + Fedora 20 + default (-O2) optimization. 
  I'm curious about why you're observing a much bigger performance 
difference with macros.  What platform are you using?

Anyway, an 18% speedup is still a speedup, so I looked into it.  GCC 
4.9.0 misses a non-obvious opportunity for function inlining.  I 
installed a tweak (attached) that should make the inlining opportunity 
obvious to compilers nowadays.  On my platform this gave a 28% speedup, 
i.e., a bit better than the macro-using patch would have.