I have long been interested in how games kludge their way out of hard problems (water movement, smoke/fog, light as well as dodging computer limitations (selective rendering (watch next time a game loads distance first and then loses it visually because it is covered by a box), mip map stuff, light again, sound echos (it was not so long ago that we first saw the Doppler effect)) especially as I have seen the "3d era" from the start and while this interest has not really applied to psychology just yet (the question has been asked though) I reckon I can give an answer.
There are a fair amount of muscles in the face to make up movement (different answers depending on where I look but suffice it to say more than 10 sort the mouth/lips alone) which all interact in a fairly "parallel" (everything works on everything else) way. As you have asked for natural these will all have to be sorted (I will leave eye movement, gestures and the like for another day although it too is probably necessary) and sorted well; human beings spend most of their lives staring at faces (you can probably drum up a few studies on isolated childhoods or those with hearing/visual impairments) and as such they are all quite good at it (see how you can tell a bad dub/bad presentation/bad actor/or how you can usually tell if someone is reading something aloud rather than saying something and now figure that it is usually worse than this for movement).
I could possibly even quote your own post as a reference to this here; word/syllable length is not that hard to determine and pass off as an open-close over said length (a fairly exotic solution for a game)
The problem is then twofold:
A game is usually dubbed in at least 5 languages although that number is ever increasing and text scripts have long hit the several megabyte mark- I am guessing your game of choice could usually wear the "story driven" label quite easily.
I have yet to see a good text to speech algorithm (same goes for the reverse which is probably needed too) for any language beyond the simplistic/scientifically calculated made up ones and to factor in accents/dialects and just general voice differences is harder still and add in emotion or injury (part of any good story no?). Text or speech to predicted facial muscle movements probably does not exist outside of medicine and serious film work (I would love to read an article on it so if someone reading has one if you could drop a link that would be great).
At this point I shall mention that we still see many "epic" games with massive budgets use text more than not or speaking parts for big characters (or worse a bank of phrases). While the last few years have seen a slight shift voice over work may not pay all that well/not be held in high regard with all the connotations of such a thing.
Ignoring algorithms for a second I suppose you could go for facial mapping similar to body movement mapping we have long seen used but I guess budget comes back into it: a recording studio is relatively inexpensive to make while facial mapping is not quite as easy.
Question two here is how do I do this for a dragon, an elf, an orc, an alien.......
Secondly as hinted at above that many muscles means a serious need for polygons- while you could probably do a tweak on the mipmap and ranging algorithms to stop that many polygons being rendered all the time that fact you will have to do it at all is going to drain resources faster.
Add in complaints about AI, level design, graphics, the script itself, general level of interactivity in world/with characters, amount of objects in the world (I am greatly enjoying the return of procedural generation), getting the game to work properly and I reckon this will be one of the "make something passable" sort of things).