I was going to say to nop90 that emulators can't be really parallelized, as manycore code must be written with that in mind from the beginning, and is only applicable when you have lots of the same stuff to calculate. It's worth it only for big amounts of data, where overhead of loading data to and then from vram can be ommited in the long run, and responsivity is not a concern. And emulators need to react fast to always changing data (instructions)
...