SVG Tech Insight: Increasing Value of Sports Content - Machine Learning for Up-Conversion HD to UHD
This fall SVG will be presenting a series of White Papers covering the latest advancements and trends in sports-production technology. The full series of SVG’s Tech Insight White Papers can be found in the SVG Fall SportsTech Journal HERE.
Following the height of the 2020 global pandemic, live sports are starting to re-emerge worldwide — albeit predominantly behind closed doors. For the majority of sports fans, video is the only way they can watch and engage with their favorite teams or players. This means the quality of the viewing experience itself has become even more critical.
With UHD being adopted by both households and broadcasters around the world, there is a marked expectation around visual quality. To realize these expectations in the immediate term, it will be necessary for some years to up-convert from HD to UHD when creating 4K UHD sports channels and content.
This is not so different from the early days of HD, where SD sporting related content had to be up-converted to HD. In the intervening years, however, machine learning as a technology has progressed sufficiently to be a serious contender for performing better up-conversions than with more conventional techniques, specifically designed to work for TV content.
Ideally, we want to process HD content into UHD with a simple black box arrangement.
The problem with conventional up-conversion, though, is that it does not offer an improved resolution, so does not fully meet the expectations of the viewer at home watching on a UHD TV. The question, therefore, becomes: can we do better for the sports fan? If so, how?
Traditional approaches to up-conversion
UHD is a progressive scan format, with the native TV formats being 3840×2160, known as 2160p59.64 (usually abbreviated to 2160p60) or 2160p50. The corresponding HD formats, with the frame/field rates set by region, are either progressive 1280×720 (720p60 or 720p50) or interlaced 1920×1080 (1080i30 or 1080i25).
Conversion from HD to UHD for progressive images at the same rate is fairly simple. It can be achieved using spatial processing only. Traditionally, this might typically use a bi-cubic interpolation filter, (a 2-dimensional interpolation commonly used for photographic image scaling.) This uses a grid of 4×4 source pixels and interpolates intermediate locations in the center of the grid. The conversion from 1280×720 to 3840×2160 requires a 3x scaling factor in each dimension and is almost the ideal case for an upsampling filter.
These types of filters can only interpolate, resulting in an image that is a better result than nearest-neighbor or bi-linear interpolation, but does not have the appearance of being a higher resolution.
Machine learning and image scaling
Machine Learning (ML) is a technique whereby a neural network learns patterns from a set of training data. Images are large, and it becomes unfeasible to create neural networks that process this data as a complete set. So, a different structure is used for image processing, known as Convolutional Neural Networks (CNNs). CNNs are structured to extract features from the images by successively processing subsets from the source image and then processes the features rather than the raw pixels.
The inbuilt non-linearity, in combination with feature-based processing, mean CNNs can invent data not in the original image. In the case of up-conversion, we are interested in the ability to create plausible new content that was not present in the original image, but that doesn’t modify the nature of the image too much. The CNN used to create the UHD data from the HD source is known as the Generator CNN.
When input source data needs to be propagated through the whole chain, possibly with scaling involved, then a specific variant of a CNN — known as a Residual Network (ResNet) — is used. A ResNet has a number of stages, each of which includes a contribution from a bypass path that carries the input data. For this study, a ResNet with scaling stages towards the end of the chain was used as the Generator CNN.
For the Generator CNN to do its job, it must be trained with a set of known data — patches of reference images — and a comparison is made between the output and the original. For training, the originals are a set of high-resolution UHD images, down-sampled to produce HD source images, then up-converted and finally compared to the originals.
The difference between the original and synthesized UHD images is calculated by the compare function with the error signal fed back to the Generator CNN. Progressively, the Generator CNN learns to create an image with features more similar to original UHD images.
The training process is dependent on the data set used for training, and the neural network tries to fit the characteristics seen during training onto the current image. This is intriguingly illustrated in Google’s AI Blog , where a neural network presented with a random noise pattern introduces shapes like the ones used during training. It is important that a diverse, representative content set is used for training. Patches from about 800 different images were used for training during the process of MediaKind’s research.
The compare function
The compare function affects the way the Generator CNN learns to process the HD source data. It is easy to calculate a sum of absolute differences between original and synthesized. This causes an issue due to training set imbalance; in this case, the imbalance is that real pictures have large proportions with relatively little fine detail, so the data set is biased towards regenerating a result like that — which is very similar to the use of a bicubic interpolation filter.
This doesn’t really achieve the objective of creating plausible fine detail.
Generative Adversarial Neural Networks
Generative Adversarial Neural Networks (GANs) are a relatively new concept , where a second neural network, known as the Discriminator CNN, is used and is itself trained during the training process of the Generator CNN. The Discriminator CNN learns to detect the difference between features that are characteristic of original UHD images and synthesized UHD images. During training, the Discriminator CNN sees either an original UHD image or a synthesized UHD image, with the detection correctness fed back to the discriminator and, if the image was a synthesized one, also fed back to the Generator CNN.
Each CNN is attempting to beat the other: the Generator by creating images that have characteristics more like originals, while the Discriminator becomes better at detecting synthesized images.
The result is the synthesis of feature details that are characteristic of original UHD images.
Hybrid GAN approach
With a GAN approach, there is no real constraint to the ability of the Generator CNN to create new detail everywhere. This means the Generator CNN can create images that diverge from the original image in more general ways. A combination of both compare functions can offer a better balance, retaining the detail regeneration, but also limiting divergence. This produces results that are subjectively better than conventional up-conversion.
What about interlace?
Conversion from 1080i60 to 2160p60 is necessarily more complex than from 720p60. Starting from 1080i, there are three basic approaches to up-conversion:
- Process only from the corresponding field
- De-interlace and process from the frame
- Process from multiple field directly
Training data is required here, which must come from 2160p video sequences. This enables a set of fields to be created, which are then downsampled, with each field coming from one frame in the original 2160p sequence, so the fields are not temporally co-located.
Surprisingly, results from field-based up-conversion tended to be better than using de-interlaced frame conversion, despite using sophisticated motion-compensated de-interlacing: the frame-based conversion being dominated by the artifacts from the de-interlacing process. However, it is clear that potentially useful data from the opposite fields did not contribute to the result, and the field-based approach missed data that could produce a better result.
Hybrid GAN with multiple fields
A solution to this is to use multiple fields’ data as the source data directly into a modified Generator CNN, letting the GAN learn how best to perform the deinterlacing function. This approach was adopted and re-trained with a new set of video-based data, where adjacent fields were also provided.
This led to both high visual spatial resolution and good temporal stability. These are, of course, best viewed as a video sequence, however an example of one frame from a test sequence shows the comparison:
Up-conversion using a hybrid GAN with multiple fields was effective across a range of content, but is especially relevant for the visual sports experience to the consumer. This offers a realistic means by which content that has more of the appearance of UHD can be created from both progressive and interlaced HD source, which in turn can enable an improved experience for the fan at home when watching a sports UHD channel.
1 A. Mordvintsev, C. Olah and M. Tyka, “Inceptionism: Going Deeper into Neural Networks,” 2015. [Online]. Available: https://ai.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html
2 I. e. a. Goodfellow, “Generative Adversarial Nets,” Neural Information Processing Systems Proceedings, vol. 27, 2014.