One possibility worth mentioning, would be to add a files0-from=F option to dd,
like du,sort,wc already have.
Now those have it because they need to operate on the complete input set,
for accumulation or sorting, and thus can't resort to separated runs
with xargs or whatever.
dd might use it as it has a very different command
syntax to the standard tools. So that would allow a general method
to efficiently read many files.
Another related thing to consider is the above would allow a single
process to handle everything, but it might be better to split the
load into a process per CPU.
thanks,
Pádraig.