You are on page 1of 15

INFORMATION

RETRIEVAL TECHNIQUES
Lecture # 14

• Index Constructions

1
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the following
sources
1. “Introduction to information retrieval” by Prabhakar Raghavan,
Christopher D. Manning, and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C.
Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,  
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon,
Marco Brambilla
Outline
• Scaling index construction
• Memory Hierarchy
• Hard Disk Tracks and Sectors
• Hard Disk Blocks
• Hardware basics

3
Scaling index construction
• In-memory index construction does not scale
• Can’t stuff entire collection into memory, sort, then write back
• How can we construct an index for very large collections?
• Taking into account the hardware constraints we just learned
about . . .
• Memory, disk, speed, etc.
Memory Hierarchy

5
Moore’s Law

6
Hard Disk Tracks and Sectors

7
Hard Disk Blocks

8
Disk Access Time
• Access time = (seek time) + (rotational delay) + (transfer time)
• Seek time – moving the head to the right track
• Rotational delay – wait until the right sector comes below the head
• Transfer time – read/transfer the data

9
Hardware basics
• Access to data in memory is much faster than access to data on disk.
• Disk seeks: No data is transferred from disk while the disk head is
being positioned.
• Therefore: Transferring one large chunk of data from disk to memory
is faster than transferring many small chunks.
• Disk I/O is block-based: Reading and writing of entire blocks (as
opposed to smaller chunks).
• Block sizes: 8KB to 256 KB.
Inverted Index

11
Hardware basics (Cont.…)
• Servers used in IR systems now typically have several GB of main
memory, sometimes tens of GB.
• Available disk space is several (2–3) orders of magnitude larger.
• Fault tolerance is very expensive: It’s much cheaper to use many
regular machines rather than one fault tolerant machine.
Distributed computing
• Distributed computing is a field of computer science that studies
distributed systems. A distributed system is a software system in
which components located on networked computers communicate
and coordinate their actions by passing messages. Wikipedia

13
Hardware assumptions
• symbol statistic value
•s average seek time 5 ms = 5 x 10−3 s
•b transfer time per byte 0.02 μs = 2 x 10−8 s
• processor’s clock rate 109 s−1
•p low-level operation 0.01 μs = 10−8 s
(e.g., compare & swap a word)
• size of main memory several GB
• size of disk space 1 TB or more
Resources for today’s lecture
• Chapter 4 of IIR
• MG Chapter 5
• Original publication on MapReduce: Dean and Ghemawat (2004)
• Original publication on SPIMI: Heinz and Zobel (2003)

You might also like