4 views

Uploaded by KARTHIK145

Compression text email
mathematics algorithms questions answers

- Presentation about the SP Theory
- Index Compression
- Learn Korean
- Video Signals Transparency in Consequence of 3D DCT Transform
- 2.Cotii Paper Cotii 2014 Ec 091 Final
- lec1
- 4.-Comtech
- Data Workbook
- Mpg Write
- Viewmiff.ps
- Computer Networks -Peterson and Davie - 4e Solution Manual
- [IJETA-V2I4P5]:Thirunavukkarasu E, Karuppusami G, Ezhilarasu P
- v51-23
- Response to Zebest
- 27-363
- 19bks_2
- DNA Lossless Differential Compression Alogorithm Based on Similarity of Genomic Sequence Database
- e Commerce Unit 5
- D09BE7 EXTC Ele1 Dcenc
- Copy of INTRODUCTION TO MPEG COMPRESSION TECH.docx

You are on page 1of 3

From: [Permission pending]

Subject: Re: Compression and mathematics

Hello Dave,

>The main idea is just to notice that some letters or letter sequences

>(e.g. 'the') occur more frequently than others (e.g. 'q'), so it makes

>sense to store them with a non-ASCII convention that uses fewer bits

>to store the common strings and more bits for the uncommon ones. For

>example, you can stick to 1-letter sequences only, but arrange the

>letters in order of frequency. Take enough letters from the top of the

>list to total a frequency of about 1/2; assign them codes which begin

>with the bit '0' and assign the others codes which begin with '1'.

>Divide up each half likewise to decide on the second bit, and so on.

>In this way, each letter gets a unique string, but the common letters

>get shorter strings, and so the total length of the stored sequence is

>less than using a 7-bit code for every letter.

Thanks for replying. I do have a question, though. (I assume I understand

the above text correctly. If it appears that that's not the case, please

tell me!)

As you assign more letters to one bit (like 'a,e,i,o,u' for first bit=0),

how do you know *which* character you mean?

Or if one character can be represented by 2 bits, and the other with 5

bits, how does the decompress-program know when a new character starts?

Thanks for the help!

[Permission pending]

==============================================================================

Date: Thu, 20 Jun 96 12:56:00 CDT

From: rusin (Dave Rusin)

To: [Permission pending]

Subject: Re: Compression and mathematics

>As you assign more letters to one bit (like 'a,e,i,o,u' for first bit=0),

>how do you know *which* character you mean?

>

>Or if one character can be represented by 2 bits, and the other with 5

>bits, how does the decompress-program know when a new character starts?

Right, you have to make sure that the bit strings used for the encoding

not only are distinct but are never subsets of each other. But this

is easily done in practice.

I don't have the real data to hand so let's make up some numbers. Suppose

we have a 16-letter alphabet whose letters appear in a text with the

following frequencies:

a

b

c

d

e

f

g

.18

.13

.12

.11

.09

.07

.06

h

i

j

k

l

m

n

o

p

.05

.04

.03

.03

.03

.02

.02

.01

.01

success: you'd use all 4-bit codes with no wasted bits (normal ASCII is

pretty inefficient, using 7 or 8 bits just for 26 distinct letters, or

about 100 distinct characters). Let's compare to the proposed alternative.

We'd give a,b,c, and d codes 0*, then rest 1*. Iterating the construction

in the first set suggests

a -> 000

b -> 001

c -> 010

d -> 011

In the second set, I guess e,f,g,h get second digit 0, the rest 1.

Naturally I can just use

e -> 1000

f -> 1001

g -> 1010

h -> 1011

for the first bunch, but for the rest I notice i,j,k take up about

half the remaining distribution, so I give them a 0 next and the rest

a 1. That is, I use

i -> 1100

j -> 11010

k -> 11011

and then split off l and m from the rest:

l -> 11100

m -> 11101

followed by

n -> 11110

o -> 111110

p -> 111111

Now let me try a random (honest!) string and try to decode:

010010111010100101101010101001011011011100101001000

Among the characters starting with 0 I see two with a 1 next, so I

look to the third digit: 010 can only be c. Then I look at the

remaining string 010111... and see another c. Peeling that off as

well we continune:

010 010 11101 010 010 11010 1010 1001 011 011 011 1001 010 010 00

c c m

c c j

g

f

d d d f

c c

The last two digits don't correspond to any letter's code.

Incidentally, consider the expected length of an encoded message. Here again

are the frequencies, now with the bit length of the code used:

a

b

c

.18

.13

.12

3

3

3

d

e

f

g

h

i

j

k

l

m

n

o

p

.11

.09

.07

.06

.05

.04

.03

.03

.03

.02

.02

.01

.01

3

4

4

4

4

4

5

5

5

5

5

6

6

The expected number of bits used per character is thus 3.42 -- better

than the 4.0 you would need for a fixed 4-bit code.

In natural languages the letters tend not to be used equally often, which

permits this kind of savings. When you look at two-or-more letter

combinations, the savings are even better because the distributions are

even more skewed when you already know one letter (e.g. many letters are

used with similar frequencies, but when you just had a "b" you're _much_

more likely to have a vowel than a consonant, except for a few

consonants of medium likelihood [b, l, r, s].) When looking at longer

strings in this way one can achieve compression rates you see quoted

in comparisons (circa 50%). Various other tricks can be used too -e.g. adjusting codes mid-file since certain words might be used for a

few paragraphs, then dropped. Smart programs also try to recognize

file-type and adjust compression scheme accordingly (for text, executable,

image, sound, etc.)

Of course, if your file is just a million random bits, then you can't

really compress much. It's a fundamental logical limitation: no

lossless compression scheme can hope to have a compression ratio of

better than 1:1 on all strings of a given length (there is no

one-to-one function from a set of cardinality 2^n to a set of

cardinality 2^(n-1) or smaller.)

Lossy compression schemes (e.g. JPEG image compression) is another

matter of course. Practical programs also must address the ability to

limit data loss in cases of transmission failure (e.g. notice that one

missing bit, above, ruins the whole code). Other topics include

encoding with distinctive character sets (UUENCODE-style), encryption

(possible e.g. with PKZIP) and of course decryption, program efficiencies

and techniques (e.g. self-compressing programs), and natural language

analysis.

dave

- Presentation about the SP TheoryUploaded byMaverick Chardet
- Index CompressionUploaded byUtkarsh Vaish
- Learn KoreanUploaded byJovy Diaz
- Video Signals Transparency in Consequence of 3D DCT TransformUploaded byAlvi Yuriski
- 2.Cotii Paper Cotii 2014 Ec 091 FinalUploaded bykhananu
- lec1Uploaded byMuhammad Atif
- 4.-ComtechUploaded byZulqarnain Awan
- Data WorkbookUploaded byarmbennett
- Mpg WriteUploaded bykmalysiakredirect
- Viewmiff.psUploaded bytechj
- Computer Networks -Peterson and Davie - 4e Solution ManualUploaded byAlejandro Portillo Guerrero
- [IJETA-V2I4P5]:Thirunavukkarasu E, Karuppusami G, Ezhilarasu PUploaded byIJETA - EighthSenseGroup
- v51-23Uploaded byvishg14
- Response to ZebestUploaded byFrank Arduini
- 27-363Uploaded bySaroj Pandey
- 19bks_2Uploaded bynmanishn
- DNA Lossless Differential Compression Alogorithm Based on Similarity of Genomic Sequence DatabaseUploaded byAnonymous Gl4IRRjzN
- e Commerce Unit 5Uploaded byAnirudh Valpadasu
- D09BE7 EXTC Ele1 DcencUploaded byDurgesh Singh
- Copy of INTRODUCTION TO MPEG COMPRESSION TECH.docxUploaded byKamal Bansal
- 7 Sem SyllabusUploaded bydheerajnarula1991
- 723_05072016_125900PMUploaded byOla Gf Olamit
- 13 Image CompressionUploaded byLakshmeesha Patlamoole
- SIMPLISMA Applied to Two-dimensional WaveletUploaded byAva Fatehi
- L1Uploaded byShashi Bhushan Kotwal
- Lt 3620172020Uploaded byAnonymous 7VPPkWS8O
- Burst 124Uploaded bylogicalthought2002
- Cdl SpecUploaded byJun Lente
- 5_A Noval.pdfUploaded byshankarsahni
- Satellite Image Resolution EnhancementUploaded byMurali Bala

- Myra Managerialaccounting 2014Uploaded byKARTHIK145
- Nanopdf.com Jan w Hbs People SpaceUploaded byKARTHIK145
- Epbm Cal 2018Uploaded byKARTHIK145
- Vx420 Vincent Chang Su15Uploaded byKARTHIK145
- org behaviourUploaded byKARTHIK145
- StrategyUploaded byKARTHIK145
- emgt525Uploaded byKARTHIK145
- ePGPx IIM-RUploaded byKARTHIK145
- MKT-B2BMarketingUploaded byKARTHIK145
- M00114 CC Marketing Sheet FNL WebUploaded byKARTHIK145
- M16436 HE Brief Cases FNLUploaded byKARTHIK145
- EPGP Brochure 2017 18Uploaded byKARTHIK145
- 2018A-MKTG777001-8921600eUploaded byKARTHIK145
- Operations ManagementUploaded byKARTHIK145
- EntrepreneurshipUploaded byKARTHIK145
- 237_120131_1003041_yesa sda fas asf asf asa fs fasafs applesauce and oranges for the best time!HUploaded bynick
- Deloitte Uk Digital Transformation TelecomUploaded byNeshvar Dmitri
- MGMT573 Kim, Sung, Min Fa2014Uploaded byKARTHIK145
- efpeassignments_f02_Uploaded byKARTHIK145
- MYRA-POM-Dr.Nair-2018.pdfUploaded byRamzan
- GLOBALIZATIONUploaded byKARTHIK145
- COMM 2012 SyllabusUploaded byKARTHIK145
- COMM_2012_Syllabus.pdfUploaded byKARTHIK145
- COMM_2012_Syllabus.pdfUploaded byKARTHIK145
- Operations ManagementUploaded byKARTHIK145
- IESE Entrepreneurship ReportUploaded byKARTHIK145
- M00186 CC Sheet FinanceUploaded byKARTHIK145
- M13833_TMN_Fall_2013_FNLUploaded byKARTHIK145
- EntrepreneurshipUploaded byKARTHIK145
- p1041-verbitski.pdfUploaded byCatcha All

- CBS 3GUploaded byMochammad Jainul
- NITRO_Mod1-15Uploaded byEdmond D Jones
- XBee 802.15.4 at CommandsUploaded byRitoEduardo
- Dm 1209operationsconsole PDFUploaded byPavan
- Detection and Prevention of Black Hole Attack in MANETUploaded byGRD Journals
- The A-Interface and Error Analysis in GSMUploaded byc2poyraz
- netronomewebinar1545168189495.pdfUploaded byTok17
- 14-audioinformationandmedia-170920003319Uploaded byCharles
- 2 to 4 line decoderUploaded byMalith Deemantha
- This is a Linux Command Line Reference for Common OperationsUploaded byhanzo
- Your Tuning Arsenal AWR, ADDM, ASH, Metrics and AdvisorsUploaded byfriza_001
- PS-Faq'sUploaded bySiji Surendran
- How to Run a HackathonUploaded bybetafilip6910
- LEX.pdfUploaded byYash Yadav
- NEC LCD ProjectorsUploaded byProjector2u
- Guest Access Configuration GuideUploaded byEduardo Delgado
- 2013 SNUG SV Synthesizable SystemVerilog PresentationUploaded bysanupkumarsingh
- IOS 12 Beta 3 Release NotesUploaded byAnonymous WPSLvcWpJ
- Nidish VashisthaUploaded byNidísh Vashisthá
- Chef_Intermediate_v1.0.0.pdfUploaded bymzbabar319
- Final Exam 6340.Doc1May20161Uploaded bywaukeshajodi
- JKKD_MTI-UI_ch03Uploaded bymgrin30
- Interfacing the Keyboard to 8051 Micro ControllerUploaded byRana Akash Singh
- OFM Upgrade Planning GuideUploaded byDian Agastia Priyambudi
- project pptUploaded bykrishna
- OpenVox VoxStack VS-GW1202-4G User ManualUploaded bymaple4VOIP
- DWDM and SDH Optical Network DesignUploaded byAmit Agrawal
- Liferay, Jira & Crowd IntegrationUploaded bymustufarahi
- CVD CampusWiredLANDesignGuide AUG13Uploaded bypaultjihe
- EL1 Ericsson Documentation Main Structure Architecture v1.3Uploaded bymaxidico2016