Text Representation

1.
2 Text, Sound and Images
Representation of text -Character Sets
 Text is a collection of characters that can be represented in binary, which is the language that
computers use to process information.
 To represent text in binary, a computer uses a character set.
 A character set is a collection of all the characters and symbols that can be represented by a
computer system and the corresponding binary codes that represent them. Each
character and symbol is assigned a unique value.
 A list of characters that have been defined by computer hardware and software. It is
necessary to have a method of coding so that the computer can understand human
characters.
 One of the most commonly used character sets is:
i. The standard ASCII character set , which assigns a unique 7−bit binary code to each
character, including uppercase and lowercase letters, digits, punctuation marks, and control
characters e.g. the ASCII code for the uppercase letter 'A' is 01000001 (65 in denary), while
the code for the character '?' is 00111111 ( 63 in denary)
ii. Extended ASCII character set , which uses 8-bit codes . This allows for characters in non-
English alphabets and for some graphical characters to be included.
 ASCII has limitations in terms of the number of characters it can represent, and it does not
support characters from languages other than English for example Chinese characters.
iii. To address these limitations, Unicode , allows for a greater range of characters and
symbols, including different languages and emojis, thus supporting many OS, search
engines and internet browsers used globally.
 Unicode uses a variable-length encoding scheme that assigns a unique code to each character,
which can be represented in binary form using multiple bytes. (ASCII uses 1 byte whilst Unicode
uses 4 bytes) e.g. the Unicode code for the heart symbol is U+2665, which can be represented in
binary form as 11100110 10011000 10100101.
 As Unicode requires more bits per character than ASCII, it can result in larger ﬁle sizes and
slower processing times when working with text-based data.
 There is an overlap with standard ASCII code, since the first 128 (English) characters are the same,
but Unicode can support several thousand different characters in total.
Similarity between ASCII and Unicode

 Both represent each character using a unique code.
 Unicode will contain all the characters that ASCII contains // ASCII is a subset of Unicode
Differences between ASCII and Unicode
 Unicode can go up to 32 bits per character (4bytes per character) whereas ASCII uses 7 to 8
bits per character ( 1 byte).
 Different languages are represented using Unicode; ASCII is only for one language.
Representing Images
An image is made up of tiny dots called pixels.

Pixels are small dots of colour that are arranged in a grid. Each pixel can be represented by a
binary code which can be processed by a computer.
If an image is created using black and white, each pixel would either be black or white. The
binary value 1 could be used to represent the colour black and the binary value 0 could be used
to represent the colour white.
If each pixel is converted to its binary value, a data set such as the following could be created.
001111100010000010100000001100101001100000001101000101100
111001010000010001111100
If the computer is informed that the image that should be created using this data is 9 pixels
wide and 10 pixels high, it can set each pixel to black or white and create the image.(as below)
- The type of data that is used to provide information, such as the dimensions and resolution
of an image, it is called metadata.
Most images use a lot more colours than black and white. Each colour has its own binary values.
Colours are created by computer screens using the Red Green Blue (RGB) colour system. This
system mixes the colours red, green and blue in different amounts to achieve each colour.
Each image has a resolution and a colour depth.
Image resolution is the number of pixels in an imade, i.e, the number of pixels wide by the number
of pixels high.
Colour depth is the number of bits used to represent each colour e.g each colour could be represented by
using 8-bit, 16-bit or 32-bit binary numbers.
The greater the number of bits the, the greater the range of colours that can be represented. If the colour
depth of an image is reduced, the quality of the image is often reduced.
If the image resolution or the colour depth of an image is changed, this will have an effect on the size of
the image file.
If the resolution is increased, the image will be created using more pixels, so more data will need to be
stored.
If the colour depth of the image is increased, each pixel will need more data to display a greater range of
colours so more data will need to be stored. Both will result in a larger file size for the image.
Representation of Sound
Converting sound to binary
Soundwaves are vibrations in the air. The human ear senses these vibrations and interprets them as sound.
Each sound wave has a frequency, wavelength and amplitude. The amplitude specifies the loudness of the
sound.
Sound waves vary continuously. This means that sound is analogue. Computers cannot work with
analogue data, so sound waves need to be sampled in order to be stored in a computer. Sampling means
measuring the amplitude of the sound wave. This is done using an analogue to digital converter (ADC).
i.e. when sound is recorded, this is done at set intervals, this is known as sampling.
The sample rate/sampling rate is the number of samples taken in a second. Sample rate is measured in
Hertz. 1 Hertz is equal to 1 sample per second. A common sample rate is 44.1 khz would require 44 100
samples to be taken each second. That is a lot of data. If the sample rate is increased, the amount of data
required for the recording is increased. This increases the file size of the file that stores the sound.
The sample resolution is the number of bits used to represent each sample. E.g a common sample
resolution is 16-bit.
The higher the sample resolution, the greater the variations in amplitude that can be stored for each
sample. This means that aspects such as loudness of the sound can be recorded more accurately. This will
also increase the amount of data that needs to be stored for each sample. This increases the file size of the
file that stores the sound.
So how is sampling used to record a sound clip?

» the amplitude of the sound wave is first determined at set time intervals (the sampling rate)
» this gives an approximate representation of the sound wave
» each sample of the sound wave is then encoded as a series of binary digits.
Using a higher sampling rate or larger resolution will result in a more faithful representation of
the original sound source. However, the higher the sampling rate and/or sampling resolution, the
greater the file size
Measurement of data storage

A bit is the basic unit of all computing memory storage terms and is either 1 or 0. The word
comes from binary digit. The byte is the smallest unit of memory in a computer. 1 byte is 8 bits.
A 4-bit number is called a nibble – half a byte.
Memory size using denary values

Name of memory size Equivalent denary value
1 kilobyte (1 KB) 1 000 bytes
1 megabyte (1 MB) 1 000 000 bytes
1 gigabyte (1 GB) 1 000 000 000 bytes
1 terabyte (1 TB) 1 000 000 000 000 bytes
1 petabyte (1 PB) 1 000 000 000 000 000 bytes
1 exabyte (1 EB) 1 000 000 000 000 000 000 bytes
The above system of numbering now only refers to some storage devices but is technically
inaccurate. It is based on the SI (base 10) system of units where 1 kilo is equal to 1000.
A 1TB hard disk drive would allow the storage of 1 × 1012 bytes according to this system.
However, since memory size is actually measured in terms of powers of 2, another system has
been adopted by the IEC (International Electrotechnical Commission) that is based on the binary
system
IEC memory size system
Name of memory size Number of bytes Equivalent denary value
1 kibibyte (1 KiB) 210 1 024 bytes
1 mebibyte (1 MiB) 220 1 048 576 bytes
1 gibibyte (1 GiB) 230 1 073 741 824 bytes
1 tebibyte (1 TiB) 240 1 099 511 627 776 bytes
1 pebibyte (1 PiB) 250 1 125 899 906 842 624 bytes
1 exbibyte (1 EiB) 260 1 152 921 504 606 846 976 bytes
This system is more accurate. Internal memories (such as RAM and ROM) should be measured
using the IEC system. A 64GiB RAM could, therefore, store 64 × 230 bytes of data (68719476736
bytes)
Calculation of file size

In this section we will look at the calculation of the file size required to hold a bitmap image and
a sound sample.
The file size of an image is calculated as:
image resolution (in pixels) × colour depth (in bits)
The size of a mono sound file is calculated as:
sample rate (in Hz) × sample resolution (in bits) × length of sample (in seconds)
For a stereo sound file, you would then multiply the result by two.
Data Compression
A method that uses an algorithm to reduce the size of the file.
sound and image files can be very large. It is therefore necessary to reduce (or compress) the
size of a file for the following reasons:
- to save storage space on devices such as the hard disk drive/solid state drive
- it will take less time to transmit the file from one device to another
- it will be quicker to upload or download the file
- to use less network bandwidth when transferring files across the network / internet
- reduced file size also reduces costs. For example, when using cloud storage, the cost is
based on the size of the files stored. Also, an internet service provider (ISP) may charge a
user based on the amount of data downloaded.
There are two types of compression that can be used, Lossy and lossless compression
Lossy compression
- uses a compression algorithm that finds and permanently removes unnecessary and reduntant data
in the file. This means the original file cannot be reconstructed once it has been compressed.
- Mainly used on an image file or sound file.
- In an image, it may reduce the resolution and/or the bit/colour depth. Unnecessary data in the
image that can be removed are the colours that the human eye cannot distinguish, or reducing the
number of pixels used (image resolution) used to create the image.
- In a sound file, it may reduce the sampling rate and/or the resolution. Unnecessary data in
the image that can be removed are the sounds that cannot be heard by the human ear.

Text Representation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Text Representation

Uploaded by

Copyright:

Available Formats

1.

2 Text, Sound and Images

Representation of text -Character Sets

Similarity between ASCII and Unicode

An image is made up of tiny dots called pixels.

So how is sampling used to record a sound clip?

Measurement of data storage

Memory size using denary values

Calculation of file size

You might also like