Video Compression

VideoNerd

Content

Sensitivity to low-frequency components

Visual Cortex, Lateral Geniculate Nucleus etc.

HVS is capable of perceiving the light approximately at contrast ratio of 10^5:1

There are two types of photoreceptors – rods and cones

Why practical profiles of most video compression standards require YUV (or YCbCr) 4:2:0?

Visibility of coding artifacts AROUND a scene cut

Texture Masking

Scene Cut Masking

The human eye does not absorb the entire visual stimulus at the same resolution

Patch Redundancy

The color we assign an object depends …

Temporal Silencing

Brightness is not always proportionate to the intensity of light entering the eye

Critical Fusion Frequency (CFF)

Field of View

Chromatic Sensitivity

JND (Just Noticeable Distortion)

Fovea

Visual Attention

 

End-to-end visual communication system:

taken from “A Survey on Perceptually Optimized Video Coding”, YUN ZHANG et al.,2022

 

1. It is well known that the human visual system (HVS) is less sensitive to distortion of high-frequency components than that of low-frequency components. This property has been utilized in video coding methods. The quantization step size used for quantizing a DCT coefficient increases as the frequency of that coefficient increases.  HEVC and H264 support the custom quantization matrices which can be chosen such that high-frequency DCT coefficients are coarsely quantized and low frequency coefficients are quantized more fine.

 

 

2. Visual processing in Primates is mainly performed by Visual Cortex and by Lateral Geniculate Nucleus (parts of brain). Eyes are sensors with simple preprocessing functions, all preprocessing (e.g. edge enhancement) is executed by ganglion cells in retina. By the way preprocessing by ganglion cells is an excellent example of distributed computation.
Recognition of primitive shapes (e.g. circle, triangle) are carried out in the brain, in columnar cells of cerebral cortex. Each column of cells is responsible for detection of its own shape, within the column the processing is serial.
HVS-related video compression is based on elimination of “something”, which is eliminated by human retina anyway. If we remove more than eliminated by retina then visual impairments are observed. If we remove less then the compression ratio is low and hence the compression is ineffective.

 

 

3. HVS is capable of perceiving the light approximately at contrast ratio of 10^5:1 simultaneously in one scene . This range is far beyond the dynamic range that the majority of existing capturing and display devices are capable of providing. Presently, the vast majority of existing consumer cameras and display devices are able to support Low Dynamic Range (LDR) video content with contrast ratio of approximately 100:1 to 1000:1.

 

 

4. There are two types of photoreceptors: rods and cones. Rods are sensitive to low light levels; they are unable to distinguish color and are predominant in the periphery. Cones, on the other hand, are sensitive to higher light levels of long, medium, and short wavelengths. They form the basis of color perception. Cone cells are mostly concentrated in the center region of the retina, called the fovea. The number of the rods, about 100 million, is higher by more than an order of magnitude compared to the number of cones, which is about 6.5 million. 

However most popular chroma-subsampling is 4:2:0 (4 luma pixels and 2 chroma pixels), the ratio luma pixels to chroma ones is 2:1. Why 8:2:0 with luma/chroma ratio 4:1 is not popular?

 

 

5. Why practical profiles of most video compression standards require YUV (or YCbCr) 4:2:0?  The human visual system (HVS) is more sensitive to structure and pattern (i.e. to luminance) than it is to color. Thus, it makes sense to keep luma pixels with a higher fidelity than chroma ones. Therefore the process of the chroma subsampling is applied. The most common schema of chroma subsumpling is to subsample the chroma channels by a factor of two in each dimension – 4:2:0 format (one Cb and one Cr samples for every four luma samples).

 

6.  According to the paper “Visual masking at video scene cuts”, by W.J. Tam et al. , the visibility of coding artifacts AROUND a scene cut is significantly reduced (masked), but in the first subsequent frame and in the previous frame.   The reduction in the visibility of visual impairments after a scene cut is called “forward masking” (a similar effect is observed in audio perception too).

In addition to the forward masking at scene cuts another unexpected phenomenon called “backward masking” is observed: the visibility of coding artifacts at the frame before a scene cut is significantly reduced (by the way, a similar backward masking is observed in audio perception). The backward masking may be explained as the result of the variation in the latency of the neural signals in the visual system.

 

7. Texture Masking

Many coding artifacts in the complex regions such as tree leaves are less visible than those in the uniform regions such as the sky. The same amount of random noise is added to the areas with different frequency distribution backgrounds is differently noticed. The noise added to flat (low frequency) background is much more visible than that added to texture (high frequency) background:

taken from the paper “A Human Visual System-Based Objective Video Distortion Measurement System”, Zhou Wang and Alan C. Bovik

Note: distortion in regions with regular pattern, such as parallel lines, is more perceivable than that in chaos textural regions, such as grasses, according to the paper “A Survey on Perceptually Optimized Video Coding”, YUN ZHANG et al.,2022

 

 

8. Scene Cut Masking. The ability of the human visual system to notice coding artifacts is significantly reduced after a scene cut (i.e. at abrupt temporal decorrelation). The first pictures of the new scene can be quantized more harshly without compromising visual quality.


According to the paper “Visual masking at video scene cuts”, by W.J. Tam et al. (which itself based on earlier reports) , the visibility of coding artifacts AROUND a scene cut is significantly reduced (masked): in the first subsequent frame and in the previous frame.The reduction in the visibility of visual impairments after a scene cut is called “forward masking” (a similar effect is observed in audio perception too). In addition to the forward masking at scene cuts another unexpected phenomenon called “backward masking” is observed: the visibility of coding artifacts at the frame before a scene cut is significantly reduced (by the way, a similar backward masking is observed in audio perception). The backward masking may be explained as video frames are buffered in someway, otherwise the backward masking contradicts to the causality,  a scene cut occurring after the backward frame,  nevertheless it affects the perception of the backward frame.

According to the M.A. thesis “Visual Temporal Masking at Video Scene Cuts“, by Carol English, 1997,  visual masking is observed at three frames from each side of a scene cut, but the masking strength was found to vary with image content. Moreover, the forward masking was found to conceal more noise than the backward masking.  The strongest masking effects were observed in the first frame after a scene cut, and in the last frame before a scene cut, in other words the neighboring frames around a scene cut can be degraded severely without affecting perceived image quality.

 

9. The human eye does not absorb the entire visual stimulus at the same resolution. That part of the stimulus which is imaged on the fovea has the highest resolution and regions which are imaged farther away have lower resolution.

 

10. Patch Redundancy.  Natural images tend to contain repetitive visual content. In particular, small (e.g., 5 × 5) image patches in a natural image tend to redundantly recur many times inside the image, within the same scale.

 

11. The color we assign an object depends not only on the particular spectrum of light reflection from it but also on the light reflected from surrounding objects.

 

12.  Temporal Silencing.  This phenomenon is triggered by the presence of large temporal image flows – objects changing in hue, luminance, size, or shape appear to stop changing, for details i attach the paper “Motion Silences Awareness of Visual Change”, by Jordan W. Suchow and George A. Alvarez, Department of Psychology, Harvard University, 2011

 

13.  Brightness is not always proportionate to the intensity of light entering the eye. The perceived brightness of an object not only depends on brightness intensity of the object, but also depends on its surrounding background:

    

The left patch appears brighter than the right one due to dark surrounding

 

 

14. Critical Fusion Frequency (CFF)

Critical Fusion Frequency is the rate of frames at which we perceive continuity between frames.  For laptops CFF of 60fps suffices, for the cinema 24fps suffices. 

 

15.  Field of View

Field of View (FoV) of human eyes covers 200◦ in width and 135◦ in height. Visual acuity is not evenly distributed in FoV, the photoreceptors and ganglion cells distributed extremely dense at the center – retinal fovea, whose radius is about 1.5 𝑚𝑚, the retinal fovea covers about 1% of the retina, the fovea becomes the most sensitive visual area. The densities of the photoreceptors and ganglion cells decrease rapidly from the fovea to the peripheral, consequently visual acuity progressively decreases as the distance to the fovea increases. 

The pupil diameter also impacts on visual acuity, it varies from 3 𝑚𝑚 at day time to 9 𝑚𝑚 at night vision, the visual acuity decreases from day to night.

 

16. Chromatic Sensitivity

HVS senses the light with wavelengths between 380 𝑛𝑚 and 800 𝑛𝑚. There are three kinds of cones (S,M,L-cone), which cone’s type is sensitive to a specific range of wavelengths: S to blue (the maximum at 437 𝑛𝑚), M – green (maximum at 533 𝑛𝑚) and L to red (564 𝑛𝑚) lights, respectively. The three types of cones explains why the RGB representation is useful.

 

17. JND (Just Noticeable Distortion)

Due to visual sensitivity and masking effects in HVS, not every distortion is perceivable. The minimum visibility threshold of pixel intensity change is denoted as JND, or in other words: JND refers to the maximum distortion that HVS cannot perceive. 

JND depends on many factors like average brightness, contrast, colorfulness, temporal activity etc. For example, the HVS sensitivity to error is generally higher in smooth regions and lower in the texture (high-detailed) regions.

 

18. Fovea

Most of the photo-receptors on the retina in the human eye are located in a small circular region called the fovea which is located on the visual axis. The scene projected onto the fovea (the center of our gaze) is therefore be perceived in high resolution. Fovea only covers an area of about 2-5 degrees of our visual field.

 

19. Visual Attention

Visual attention is a complex cognitive process, therefore it’s challenging to model it.

Human faces tend to attract visual attention as well as moving objects (this is an evolutionary acquired feature).

18 Responses

  1. My spouse and I stumbled over here from a different web page and thought I might as well check things out. I like what I see so now i’m following you. Look forward to looking at your web page yet again.

  2. A lot of whatever you articulate happens to be supprisingly accurate and it makes me ponder the reason why I hadn’t looked at this in this light before. This piece truly did turn the light on for me as far as this specific subject goes. Nevertheless there is actually just one issue I am not necessarily too comfy with so whilst I try to reconcile that with the main idea of your point, permit me see exactly what all the rest of your readers have to say.Well done.

  3. obviously like your web site but you need to check the spelling on quite a few of your posts. Several of them are rife with spelling issues and I to find it very bothersome to inform the truth however I?¦ll definitely come back again.

  4. Hello there! I know this is kind of off topic but I was wondering which blog platform are you using for this website? I’m getting fed up of WordPress because I’ve had issues with hackers and I’m looking at options for another platform. I would be great if you could point me in the direction of a good platform.

  5. Definitely consider that which you said. Your favourite justification seemed to be at the web the simplest factor to bear in mind of. I say to you, I certainly get annoyed whilst people consider issues that they just don’t realize about. You controlled to hit the nail upon the highest as smartly as outlined out the whole thing without having side-effects , people can take a signal. Will probably be again to get more. Thank you

Leave a Reply

Your email address will not be published. Required fields are marked *