Video takes up a lot of space. One minute of uncompressed video takes up about 60GB. Because of that, video must be compressed. Usually, lossy compression is more useful, because lossless compression can lose a relatively large amount of data before you start noticing a difference. However, there are a lot of factors to take in consideration such as the video quality, the quantity of data needed, the robustness to data losses and errors, the complexity of encoding and decoding algorithms…
There are several techniques used to reduce the temporal and spatial redundancy of video sequence.
- Reduce the number of bits used for the quantification.
- Reduce the color resolution. Usually it uses 4:2:2 or 4:2:0 resolution.
- Compress the image in the frequency domain using the DCT.
- Motion compensation techniques.
- Variable-length codes based on the probability of occurrence for each possible value of the sensor symbol.
- Use of predictive coding techniques due to the similarity of straight data.
Every frame is compressed in the frequency domain. Once compressed, all series of frames are also compressed using predictive coding and motion compensation techniques. The steps to encode each frame during the video coding can be summarized into:
- The frame is divided into blocks, usually 8×8 pixel blocks.
- Every block is separately encoded in the frequential domain using the DCT, a domain where the representation is more compact. The result is an 8×8 transform coefficient array in which the top-left element is the DC component (zero frequency) and entries with increasing vertical and horizontal index values represent higher vertical and horizontal spatial frequencies.
- Every 8×8 coefficient array is quantizated using quantization tables. The quantization steps are used for each transform coefficient based on the visual perception. As a result, the area of high frequency (bottom-right) contains several coefficients set to zero, that can be compressed.
- The 8×8 quantization blocks are scanned using Zigzag scanning.
- The coefficients are transmitted using run-length encoding or Huffman coding
Video sequences present a large amount of temporal redundancy. Usually neighbor images (the previous and next frame) are almost the same so a frame can be predicted using the closest frames. Different prediction techniques are used:
- Intra mode prediction: the image is coded using spatial coding with DCT coeffficients.
- Inter mode prediciton: it estimates the motion and the scene using the decoded images.
The motion estimation process consists on:
- Dividing the image in macroblocks of 16×16 pixels.
- Searching each macroblock in the reference image. This process is know as Block Matching. If it uses backward prediction (prediction from an early frame) all pixels in the current image have a reference in the previous one but if it uses forward prediction some pixels in the current image may not have a reference in the previous one.
- Extracting the motion vectors and the DCT error coefficients. Motion vectors provide an offset from the coordinates of the macroblock in the decoded pictures to the coordinates in the reference frame. The diference betwen the macroblock in the decoded picture and the reference is coded using the DCT, called DCT error coefficients.
All the video coding standards, such as MPEG-2 or MPEG-4, use both temporal and spatial coding combined with advanced coding techniques. The most useful video standards are detailed on the next post.