SVG Tech Insight: Developing an Intelligent Timing Model for Live Media Production in the Cloud

This fall, SVG will be presenting a series of White Papers covering the latest advancements and trends in sports-production technology. The full series of SVG’s Tech Insight White Papers can be found in the SVG Fall SportsTech Journal HERE.

Introduction

In live media production, timing has two primary functions. It allows tracking of the contributions to the program so that the production team can tell a logical, linear story. And it synchronizes the processing equipment used in the production chain so that team members can do things like seamlessly switch between video elements.

The traditional method for timing a production uses video frames as a standard unit of measure. But new models are required for distributed live productions where part of signal switching occurs in the cloud.

This paper focuses on live production with a methodology for using new technology to keep each operator’s user experience coherent for their location while enabling individual contributions to logically align in the final production. Using cloud service providers for asset storage, content sharing, or program playout and emission rely less on specific timing.

Timing Problems in Distributed Production

A common reason to adopt a cloud topology is to coordinate work across a geographically dispersed team with distributed processing. Problematically, geographic distance creates delay in information transmission. Even across the most advanced fiberoptic network connections available, data can’t travel faster than the speed of light. Therefore, individuals contributing to a production will have varying amounts of latency that depend on the physical distance between the operator and the data center where the processing occurs. The farther the separation, the longer the delay.

In reality, light-speed connections aren’t available over long distances and they are generally very expensive. Instead, Wide Area Networks are typically employed. But network equipment also introduces additional delay. Individuals connecting over the internet have little control over the number of hops their data path may take before arriving at their destination, further increasing path-dependent latencies. As a result, times for receiving and returning information will vary for each member of a geographically distributed production team.

While some aspects of a production may be accomplished in parallel, many steps in creating a live program feed must be sequential to maintain the thread of the creative process. This requires maintaining for each team member the perception that their contributions are part of a real-time sequence, regardless of when they receive and return their contributions. It also requires all contributions to the program to be correctly aligned before distributing the program to the audience.

Why Proposed Solutions Don’t Solve the Problems

Signal standardization and content creation introduce latency
Most signals within a broadcast facility must be processed. Each processing step creates additional delay.1 For example:

  • Initial synchronization of a “wild feed” = 1-2 frames
  • Conversion of program inputs to production format = 2 frames
  • Creation of video effects in production switcher = 2 frames
  • Conversion of program to transport formats and distribution codecs = 2+ frames

Traditionally, the signals within the system are adjusted, or “back-timed” relative to a master clock to maintain alignment. In the example above, if any one signal followed the outlined path, all of the signals in the facility would need to be back-timed by a minimum of 8 frames. But what is the master clock in a distributed system? It is possible to create a global master clock. Modern network technology is based on NTP, and the higher precision PTP; clock protocols that trace back to atomic clocks and ensure synchronization at global scale. But even a worldwide clock must respect causality. Some actions must follow others in sequence.

Transit times introduce additional variable latency
Because all contributions in a distributed system would have different transit times from the operator’s workstation to the processing location, the system would have to treat them as wild feeds, requiring a compensating adjustment for every network path and output of that source.

If all timestamps are forced to line up in a sequential fashion using a combination of the operator’s workstation clock and a master clock, then the production will experience significant and growing delay over the course of the program as team members are forced to wait while other members return their contributions. The total time it takes to produce the content grows longer and longer.

How Do We Provide a Solution That Works?

To provide a timing solution that works, we have to return to our fundamental reasons for timing

  • Enabling the production team to tell a linear story.
  • Synchronizing the equipment in the production chain.

Enabling the production team to tell the best story
Our standard for system response times should match actual human realities. The de facto standard of measuring time on a frame-based clock, at 30 (25) msec comes from a time when analog color carrier frequencies required this timing to ensure accurate delivery of color to the television set. This time base far exceeds what humans — and modern TVs — need.

Studies show that the fastest a person can react to an outside stimulus without any type of pre-cue is about 180 milliseconds. As shown in the timeline illustration, a minimum response time to receive a stimulus, take action, and recognize a new state is about 240 milliseconds. The fastest replay operator in the world would take 240 milliseconds to see the action on the monitor, press the mark in button, and recognize that the clip record has started.

Figure 1: Human reaction timeline

More important than the actual time it takes to respond is the time it takes to notice a difference. People perceive differences in rates of change much faster than they perceive actual change, as long as they differ by more than 20%. An ideal human will not be able to perceive the difference between 100 and 120 milliseconds of delay.

Finally, the synchronicity of interrelated stimulus must also be taken into account. Humans are most sensitive to changes between audio and video and least sensitive to changes between video and video. For an example, consider a multiviewer on a monitor wall with six video windows. If all six windows change within 120 (100+20%) milliseconds after the cue they will be perceived as in time. But if one changes first, the remainder will be perceived as late. Audio and video must change within 80 milliseconds of each other to be imperceptible. This is why lip sync is so noticeable.

It is important to remember that the user experience only needs to feel live to the operator. At the local workstation, audio, video, monitoring, intercom, and control must all align within the tolerance ranges discussed above.

Unlike previous operations models, the local response times can be independent of any clock time. To feel live, they simply need to be coherent with each other. If the system manages the differential latency of the arriving essences at the operator’s location, then back timing sources is not required.

All creative decisions made by the operator and their associated processing time can be tracked relative to the operator’s time. The order and local timing of the decisions are maintained. The operator experiences the phase-aligned environment they are used to. Yet the total environment is time-shifted relative to the source.

To maintain linear storytelling, the final result of the operator’s work is time stamped with whatever offset time is best to synchronize the work across the production chain.

Synchronizing the production chain

To provide a unified timing model across the production chain, modern technology should follow a common design strategy:

a) For each audio, video, or other essence stream entering the system, identify an essence landmark such as the video top of frame, or audio time stamp.
b) Align all common essences based on their landmark with an established relationship between different essence types.
c) As editorial decisions are made, time-align the decisions with the essence.
d) Process the essence as orchestrated by the editorial decisions. The editor, or processing function, can be located anywhere.
e) Time stamp the final output based on any user-defined clock
f) If required, a final NTP or PTP time stamp may be added.

Using these steps, the relative latency values of all essences are known, and differential adjustments can be calculated. Only as essences are aligned is a common time base is required. This may be any time base that is mutually suitable for all essences about to be processed.

Unchaining individual workstations from external time is possible because today we operate faster than real time using technologies that did not exist when frames per second timing was implemented. Frame syncs are replaced by memory buffers whose depth is adjusted to match the timing offset required for each essence.

Following this design strategy, any live production task can be carried out in what feels like real time and assembled in a linear fashion to create programming that exceeds audience expectations. Even with complicated production tasks, total execution time is a few seconds. Compare this with today’s traditional live broadcasts which, in the best of circumstances, still take as much as 50 seconds to get final emission delivery to the home.

Password must contain the following:

A lowercase letter

A capital (uppercase) letter

A number

Minimum 8 characters