Data Compression (1) Hai Tao Department of Computer Engineering University of California at Santa Cruz Data Compression Why ? Storing or transmitting multimedia data requires large space or bandwidth The size of one hour 44K sample/sec 16-bit stereo (two channels) audio is 3600x44000x2x2= 633.6MB, which can be recorded on one CD (650 MB). MP3 compression can reduce this number by factor of 10 The size of a 500x500 color image is 750KB without compression (JPEG can reduced this by a factor of 10 to 20) The size of one minute real-time, full size, color video clip is 60x30x640x480x3= 1.659GB. A two-hour movie requires 200GB. MPEG2 compression can bring this number down to 4.7 GB (DVD) Department of Computer Engineering University of California at Santa Cruz Compression methods Run-length Coding Entropy Coding Huffman Coding MPEG H.261 Bit Position Layered Coding DCT Source Coding FFT Transformation DM Prediction Vector Quantization JPEG DPCM Sub-band Coding Hybrid Coding Sub-sampling Arithmetic Coding DV1 RTV, DV1 PLV Department of Computer Engineering University of California at Santa Cruz Run-length coding Example: A scanline of a binary image is 00000 00000 00000 00000 00010 00000 00000 01000 00000 00000 Total of 50 bits However, strings of consecutive 0s or 1s can be represented more efficiently 0(23) 1(1) 0(12) 1(1) 0(13) If the counts can be represented using 5 bits, then we can reduce the amount of data to 5+5*5=30 bits. A compression ratio of 40% Department of Computer Engineering University of California at Santa Cruz Huffman coding Example: 4 letters in language A B S Z To uniquely encode each letter, we need two bits A- 00 B-01 S-10 Z 11 A message AAABSAAAAZ is encoded with 20 bits Now how about assign A- 0 B-100 S-101 Z 11 The same message can be encoded using 15 bits The basic idea behind Huffman coding algorithm is to assign shorter codewords to more frequently used symbols Department of Computer Engineering University of California at Santa Cruz Huffman coding Problem statement Given a set of N symbols S={s i, i=1,N} with probabilities of occurrence P i, i=1,N, find the optimal encoding of the the symbol to achieve the minimum transmission rate (bits/symbol) Example: Five symbols, A,B,C,D,E with probabilities of P(A)=0.16, P(B)=0.51 P(C)=0.09 P(D)=0.13 P(E)=0.11 Without Huffman coding, 3 bits are needed for each symbol Department of Computer Engineering University of California at Santa Cruz Huffman Coding - Algorithm Algorithm Each symbol is a leave node in a tree Combining the two symbols or composite symbols with the least probabilities to form a new parent composite symbols, which has the combined probabilities. Assign a bit 0 and 1 to the two links Continue this process till all symbols merged into one root node. For each symbol, the sequence of the 0s and 1s from the root node to the symbol is the code word Example Department of Computer Engineering University of California at Santa Cruz Huffman Coding - Example Step 1 Step 2 Step 3 P(C)=0.09) P(E)=0.11) P(CE)=0.20) P(D)=0.13) P(A)=0.16) P(AD)=0.29) 1 0 1 0 P(C)=0.09) P(E)=0.11) P(CE)=0.20) P(D)=0.13) P(A)=0.16) P(AD)=0.29) 1 0 1 0 P(ACDE)=0.49) 1 0 Department of Computer Engineering University of California at Santa Cruz Huffman Coding - Example Step 4 Step 5 A=000, B=1, C=011, D=001, E=010 Expected bits/symbol is 3*(0.16+0.09+0.13+0.11)+1*0.51=3*0.49+1*0.51=1.98bit/symbol Compression ratio is 1.02/3=34% P(C)=0.09) P(E)=0.11) P(CE)=0.20) P(D)=0.13) P(A)=0.16) P(AD)=0.29) 1 0 1 0 P(ACDE)=0.49) 1 0 P(ABCDE)=1) P(B)=0.51) 1 0