The Impossible Dream: Perfect Lip Sync

There is definitely plenty that can be done to improve lip sync.  Making it perfect, however, might not be possible.

Perhaps it would be best to start with a definition.  Lip sync is the synchronization of the sounds emerging from moving lips with the images of those moving lips.  No moving images, no lip-sync issues, per se.

There are many creation myths, and one associated with moving images and sound is that The Jazz Singer (1927) was the first sound movie.  It wasn’t.

It wasn’t even the first Warner Bros. Vitaphone synchronized-sound feature movie.  And it wasn’t the first “all-talking, all-singing” sound movie, not least because it wasn’t all talking or all singing.  Here’s a typical “silent” movie “intertitle,” from one of many non-talking sections of The Jazz Singer:

Jazz Singer slide

What the first sync-sound movie actually was is not obvious.  Scientific American suggested adding sound to 3D projected images in 1877, but those were to be still pictures.  Wordsworth Donisthorpe responded in Nature a few weeks later that he could do it with moving pictures.

It’s possible (based on recollections decades later) that some experimental apparatus was built around 1888.  Edison wrote in his fourth motion-picture patent caveat that “all movements of a person photographed will be exactly coincident with any sound made by him.”

Edison Kinetophone There’s no question that Edison demonstrated sound-movie Kinetophones by 1893.  But, despite a contemporary report that the sound was in sync with the pictures, it’s possible that the sound merely started at the same time as the pictures.  And the Kinetophone was a one-viewer-at-a-time, short-duration system.

phono-cinema-theatre-exposition-de-1900There’s also no question that a form of sync-sound movies was shown at the Phono-Cinéma-Théatre at the World’s Fair in Paris in 1900.   But the system was different from what we’re accustomed to in video production today.

First, the pictures were captured.  Then, watching the images of themselves on screen, the performers lip-synched to what they had done during a phonograph sound-recording session.

In presentation, the process was reversed.  The projectionist used a telephone receiver to listen to the sound (from a phonograph in the orchestra pit) and adjusted the cranking speed of the projector to maintain lip sync (or at least to attempt to maintain something pretty close to proper lip sync).

True lip sync, with sound and picture locked, was actually patented towards the end of the 19th century, and implemented no later than the first decade of the 20th.  More of the history may be found here:

From roughly the beginning of the 20th century to the introduction of digital video processing in the early 1970s, there was good lip sync.  But it wasn’t always automatic.

Movie sound was typically recorded separately from pictures.  A clapper atop the slate provided a sync point, and various mechanisms were used to make the camera and sound-recorder motors run in sync, but sound was manually synchronized to picture.  Video recorders captured both sound and picture together, but editors using early mechanical equipment had to take into consideration a considerable distance between the video and audio heads.

Then came that digital video processing.  The CVS 500 in 1973 could not only synchronize incoming feeds but also shrink them to a quarter of their size, something that seems trivial today but was near miraculous at the time.  Unfortunately, it also delayed the video by one field (half a frame).

In the grand scheme of things, half a frame is not a lot.  But multiple passes through video-delaying devices soon followed.  A feed to a network might get synchronized, and then the network’s feed to a station might get synchronized again.  One pass through a digital effects processor might have been used to shrink an image so it fits within a larger one, and another pass might have been used to push both images off the screen.

International standards converters intentionally used longer delays to help with their frame-rate conversion.  Today, there are also up- and down-converters to and from HDTV and 24p.

There was even a video delay caused, during a brief period of madness in U.S. television, by a different timing issue.  When NTSC color was introduced in 1953, there was no specified relationship between the phase of the color subcarrier and the horizontal sync pulse, because it didn’t matter.  When color recorders were introduced, however, that lack of specificity tended to increase the size of the horizontal blanking interval (the period between the end of video at right edge of the picture and its start at the left).

After enough generations of re-recording and editing, the increase could violate FCC regulations (though it was almost never enough to be visible on a home TV).  So, after digital video effects units were introduced that could expand the picture, broadcasters began using them to conform to the regulations.  Pictures got blurry, and sound got out of sync, before the FCC announced that it wouldn’t demand the correction.

All of those video-delaying devices advanced the sound, the worst possible lip-sync problem.  And, initially, there were no matching audio delays.  Some news broadcasts (usually involving frame synchronizers and often adding standards conversion and video effects) started to look as non-synchronous as some Fellini movies.  Today, with audio delays available, there’s no longer any good excuse for lip-sync errors in production and post.

Then there’s distribution, commonly involving MPEG bit-rate reduction.  Presentation time stamps (PTS) are used to lock audio and video together.  Unfortunately, decoders aren’t required to use them, and, if they don’t, lip sync can slip.  If your TV set, cable box, or satellite receiver has slipping lip sync, the best you can do (other than complaining) is change channels and come back; the signal interruption will usually cause the decoder to lock up.  And, if you’ve been watching the same channel for a long time, it might be a good idea to change channels and return before settling in for a movie.

After enough complaints or lost business, perhaps all decoders will someday keep and maintain lip sync.  And it’s certainly possible to make sure any full-picture video delays are matched by audio delays (imaging chips and displays sometimes introduce differential delays between the tops and bottoms of pictures, but they’re very brief).  But then there is space, the final frontier as far as lip sync is concerned.

Light travels so fast that it’s essentially instantaneous.  Sound is a lot slower.  Aircraft have traveled faster than sound; bullets do it a lot.  At nominal temperature and humidity, sound travels a little less than 37 feet in the course of one video frame.

If someone is singing 50 feet away from a microphone (as on an opera stage), the audio will be picked up more than a frame late.  If the sound is then heard in the back row of a movie theater, there will be still more delay.

Harding Inauguration small

Inauguration of President Harding

There’s one way around this.  It’s called visual-acoustic perspective.  When we see someone speaking from a distance, we don’t expect the lip sync to be correct.  That’s why someone sitting three frames away from the stage, hearing a singer two frames behind the proscenium, doesn’t think there’s anything wrong.

Unfortunately, tight lenses can create close-ups, and close-ups make people want tight lip sync, even when it’s physically impossible.  There have already been cases when viewers of live transmissions to cinemas have complained of varying lip sync when all that was happening was cutting between wide shots and close-ups.

Directors of productions shown at large viewing distances should bear that problem in mind.  Otherwise, there’s not much that can be done about acoustic lip-sync issues.  Advancing the sound doesn’t help viewers in the front row.

Otherwise, just make sure all video delays are matched by audio delays.  And complain regularly about decoders not using time stamps.

Password must contain the following:

A lowercase letter

A capital (uppercase) letter

A number

Minimum 8 characters