5643

Multiprocessing the Sieve of Eratosthenes
Shahid H. Bokhari University of Engineering and Technology Lahore, Pakistan
A parallel version of
finding prime numbers tests the capabilities of a multiprocessor.
the two-th usandi described a procedure for all prime numbers inagiven range. fofinding year-old algorithmn year-old algoridim for This straightforward algorithm, known as
ndago, Eratosthenes of Cyrene
ore than two thousand years
performnance
benchmark. I have implemented a parallel version of the Sieve on the Flex/32 shared-memory multiprocessor at NASA Langley Research Center. In this article, I describe this implementation and discuss its perfortive to several fundamental performance parameters of parallel machines, such as spawning time, signaling time, memory access, and overhead of process switching. Because of the nature of this algorithm, there is no advantage in using more than four or five processors unless dynamic load balancing is employed. I describe the
50
the Sieve of Eratosthenes, is to this day the only procedure for finding prime numbers. I In recent years it has been of interest to computer scientists and engineers because it serves as a convenient benchmark against which to measure some aspects of a computer's performance. Specifically, the Sieve tests the power of a machine (or of a compiler) to access a very large array in memory rapidly and repeatedly. This power is clearly influenced by memory access time, the speed at which indexing is done, and the overhead of looping. Over the last decade, we have seen, in both professional journals and popular computer magazines,2'3 numerous instances of the Sieve's use as a
bounds and simulated results. There is good agreement between simulated results and observed run times, which indicates that the performance parameters used in the simulation are an accurate measure of the machine's performance. I feel that the parallel version of the Sieve is very useful as a test of some of the capabilities ofa parallel machine. The parallel algorithm is straightforward, and so is the process for checking the final results. However, the efficient implementation of the algorithm on a real parallel machine, especially in the dynamic load-balancing case, requires thoughtful design. The basic operations required by this algorithm are very simple. This simplicity of basic operations means that the slightest overhead shows up prominently in performance data. The Sieve thus serves not only as a very severe test of the capabilities of a parallel processor; it is also an interesting challenge for the programmer.
performance with theoretical lower
of the algorithm with and without load balancing, and compare the
mance. I show that the algorithm is sensi-
The Sieve of Eratosthenes

Suppose I wish to find all prime numbers between 2 and N. Eratosthenes' algorithm proceeds as follows. The numbers 2 to Nare first written down. I start with the first number (that is, 2) and, stepping through the list, cross out all of its multiples (in this case, 4, 6, 8, ...). Having
COMPUTER
0018-9162/87/0400"0050S01.00 0
1987 IEEE
exhausted the list, I return to the starting point, look at the next uncrossed number on the list (which is 3), and cross out its multiples (6,9, 12, ...). Coming back again to the beginning of the list, the next uncrossed number I encounter is 5 (4 having been crossed out during the first pass). I repeat the crossing-out sweep through the list with the multiples of 5, and so on. The uncrossed numbers left at the end of this process (which is illustrated in Figure 1 for N = 50) are the primes. Several issues need to be clarified. First, I explicitly eliminate all multiples of 2 from the list, even though I could have started with a list of odd numbers only and reduced the total work considerably. I do not exclude even numbers because my purpose is not to generate prime numbers efficiently, but to use the Sieve as a measure of a computer's performance. Second, I need perform the crossing-out sweeps only up to VNT, because any number less than N cannot have all its factors greater than VNT. For example, ifN = 100, I need only sweep through four times to remove the multiples of 2, 3, 5, and 7. There is no point in removing the multiples of I11, since 22 will have been removed during the sweep with 2, 33 will have been removed during the sweep with 3, and so on. It is interesting that the first of these issues (listing only the odd numbers) was known to Eratosthenes and is mentioned in the ancient text Introduction to Arithmetic by Nicomachus,4 which is the oldest reference to the Sieve commonly available in translation. Eratosthenes does not, however, state the notion of restricting the sweeps to less than VN. This notion first appears explicitly in the works of Leonardo of Pisa (who is also known as Fibonacci) more than 1400 years later. 5
5, and so on. The uncrossed numbers that remain at the end of this process are the primes. It is amusing to speculate if this is actually how Eratosthenes sieved prime numbers, whether he had an unlimited supply of slaves, and, if not, how many slaves he used to sieve what range of numbers and in how much time. If this really happened, it probably represents one of the earliest instances of the implementation of a parallel algorithm. However, this "parallel" version of the Sieve is not alluded to X d x in any of the standard reference works on classical mathematics. 5-7 For the purposes of this article, it suffices to note that the Sieve is easily parallelized and that the scenario described above can be programmed on a real sharedmemory multiprocessor. There are, however, some subtle points that need to be brought out. Note first of all that synchronization is implicit in the above description. Eratosthenes politely waits until all the slaves he has dispatched have moved at least one number away from him before he moves forward himself and examines the next number. This ensures that he does not uselessly dispatch a slave to cross out the multiples of a non-prime. For example, when Eratosthenes looks at 6, the slaves crossing out multiples of 2 and 3 should already have passed over it. Ifneither of them has, then a new slave will needlessly be dispatched to cross out multiples of 6. This will not affect the correctness of the algorithm, but it will definitely reduce the algorithm's efficiency. Furthermore, slaves who are moving rapidly through the fieldbecause they are fast runners, or because they are striking out multiples of large primes, or for both reasons-must not i overrun the slave performing the initializa- l tion, who is writing down the integers
lll
A parallel version of the from 2 to N. Sieve

E. Shapiro6 proposes the following parallel version of the Sieve. Eratosthenes starts off with a large number of slaves and instructs the first one to write down the in-
tegers 2, 3, 4, ... on the ground in a field. A second slave is dispatched with instructions to cross out all multiples of 2. Eratosthenes now moves down the "list" of numbers and dispatches a new slave whenever an uncrossed number is encountered. Thus, the third slave is sent off with instructions to cross out the multiples of 3, the fourth crosses out all multiples of April 1987
Figure 1. Sieving the ru-st 50 numbers. The original list (first column) contains all integers from 2 to 50. The first sweep
removes all multiples of 2; the second
sweep removes all multiples of 3; and so
on. It is not necessary to continue the sweep-out process beyond V5b0 (seemain text for explanation). The numbers that remain in the last column (that is, those left after the sweep-out process) are the primes.
51
Console~ ,
L U nix Unix
10
p Im
1~
G-Q ~~ ~~~~~~~2 2s 2y,. 4:i.'
Figure 2. Architecture of the Flex/32

multiprocessor.
The Lt> _-Sss> stTipoint to appreciate is that i- second 1 the total amount of work done by any lathgnwT~t*r ~ W8 tOb~US~d ;1 slave is inversely proportional to the size of the prime number whose multiples he is t tt$p inrproc04ur.*1 eliminating. When slaves are sieving a finite range of numbers, I expect that those handling the last few primes will DO W their work quickly and be idle for A~~~~~T ~~~~~~~~~~~~finish 4) ( most of the time. This phenomenon has a r rtsuproutfno_lave'onsp.c1flodprooessr *> _major impact on the efficiency of the parallel algorithm, as I discuss below.
i
t;f
*.sd int
maln()
WI
do ~ ~
'
The Flex/32
The Flex/32 currently being installed at NASA Langley Research Center is made up of 20 processors that are based on the National Semiconductor 32032 microprocessor.8 Each processor has about IM byte of local memory, and all processors can access about 2M bytes of shared memory through a global bus.
)&naln1~ce<rotIlmI~mmultiprocessor
._ S tw' 0 00t0,0 00 x, ,t;^00 l' i'';'0 " 00'
AC0,0,000tu
I hmn~alT }
00 ,5
thlrP
IA
vW
; on g sweps) 00
_9oP~ofthI,procedur.rur.8ono.chproc.ssor
cessors reside on each of 10 local buses,
figuration of the machine. Palrs of pro-
Figure 2
illustrates
the current con-
|
1*R
uppercase letters for the convenience of the reader.) Access to these variables is 4* I)t0~ I0# _ o spo4 by means of Lock/Unlock commands :000g X Sf;::0X S ~~~~~~~~~~~~~whenever the possibility of synchroniza0 0^ *ffQ;t00000> ;<fB< >- ;00'j~ * tion ; problems exists. (For thesaeo 00:S0:d 0000 $4g S i 90 ~~~~~~~~~~clarity, these commands are not shown in ; ; above listing.) Synchronization is l : 0 X < ~~~~~~~~~~~~~~~~~the
_____________________
Figure 3. The basic parallel algorithm for the Sieve. (Variables in shared memory W~~~~I~~~~%A~~~~~~11 ~~~~have their names spelled wholly in
Wh (DOlilE
writing the same value (that is, ' ').

COMPUTER
accessing LIST, since all slaves are
52
and all 20 machines must access the shared memory by going through the local and then the global bus. The processors are numbered from 1 to 20; at Langley, processors 1 and 2 operate under a Unix operating system and are used (a) for program development and (b) for loading and booting up the remaining processors when a parallel program is to be run. Thus, under normal circumstances, 18 processors are available for parallel processing-these run the MMOS concurrent operating system. Amongst these 18, only processor 3 has a console attached to it. It has been my experience at Langley that the overhead of handling console I/O can significantly impact the running times of processes on this processor. I have therefore excluded this processor from my experiments, which report timings for only 17 processors. At the time the research reported in this
Once initialization is complete, the master moves forward through the array, dispatching a slave whenever it encounters a location that is set to " *. " Slaves move through the array, setting to " " locations that are multiples of prime numbers. Should no idle slaves be available, the master waits until a working slave has fmished its sweep and signaled its avail-
Figure 3 lists this basic algorithm, and Figure 4 illustrates its execution for N = 100 and p = 3.
ability.
Details of implementation. Examination of Figure 3 reveals that the bulk of the work in this algorithm is the setting to " " of locations in shared memory. This is perhaps the simplest memory operation that one can conceive. It effectively
article was carried out, the machine was operating in an experimental mode, and
0T
|
5 34 3
~ |
~
:7
2 ~ 3 | 2 :4 384 3 23 36
77
not all of the facilities envisaged for its ultimate form were available on it. Of greatest importance is the fact that the facility for locking shared memory locations was implemented in software rather than in hardware and resulted in heavy overhead.
383au1ltz*lI
37
1:d :34
o2: 6
41
MIZ2I Zt M 4
M -#r
l3
The basic parallel
algorithm
|
2
5: 1
1 [,7 AA
2:16
0 5 ?
i21 M
3
In my parallel implementation of the Sieve of Eratosthenes, there is one master process that orchestrates all work. I use one slave process per processor and timeshare the master process with one of the slaves. (The reason for this is discussed below.) Iuse an array of dimension Nto sieve N numbers. This array is of type character and its elements are all initialized to "." Locations whose indices are multiples of primes are then set to" "(blank) as the algorithm proceeds. In the end, the indices/of l locationsthat are stih scare the prime numbers. The initialization is done in parallel by all available slaves. Thus, if there are p processors each initializes a subrange of length approximateytNlip. This is in contrast with the scenario described in the section entitled "A parallel version of the Sieve" (above); in that scenario, one slave does the initialization, The- deiintoiiilzei.aale,nue
P1 Fihe a rg 1
Mt 1 d
68 6 70' 71 7 81 82 83 84
M.1
M(,I 14 *r 28
7
384
8 39 -
72
7:68 7: 35 7: 4 7: 98
Xa8 3.10
70 7 218 9
Aetqf P3f
P1 fir$'|e -T -
r98
2f.9
11
88
instru d t
Figure 4. Execution of the parallel Sieve algorithm for N 100, p 3. Time gopetions ae asumed to advances from top to bottom. Al memory and take one unit of time. Spawning time is not shown in this figure. An idle processor is rereene by -"WSome seuece ofntrie s(indicated ---. -") hiave been omitted
possibility of
"flspims"anesIteigresmeslveaeecrddshaenot ove shoo arlae-the overhead would be many times the amount of useful
precludes the use of any synchronization to ensure that the master processor does
kl ~
41 [e~ioj
~ I
tie<v. 31
4:24 ^ W
5.~
W -i50 -I
work done. I have therefore not used any such synchronization. As I discussed in the section entitled "A parallel version of the Sieve" (above), this omission of synchronization does not invalidate the ago rithm, but it does cause an overhead because some slaves are occasionally tedispatched to strike out multiples of nonprimes. This overhead is several orders of magnitude less than the overhead of synmultpchronization, and I am happy to live with
it.
Figure 5 When the master proceeds very rpidly as compared with the slaves, the possibility of "false primes" arises. In the figre, some slaves ar recorded as needlessly strikng off multiples of 4,6, and B. This occurs because the signaling time mn this example is five times the memory access time. In general, false primes can occur whenever the master progresses faster than a slave. I minimize these fals primes by always time-sharing the master with one of the slaves,
J.,
To minimize the incidence of the abovementioned "false primes," I timeshare the master process with another slave. This reduces the rate of progress of the master and almost eliminates the incidence of overruns. If I were not to do this, there would be a large amount of extra work caused by false primes, as illustrated in Figure 5. When running this algorithm with p processors, I spawn all p slaves at the ~~~~~~beginning of the program. The master signals each slave to proceed with a prime number. Each slave signals back to the master when it has finished the work allotted to it. A different approach would be to spawn a slave whenever one is required V> Sand to let it die upon completing its work. While this alternative results in a more elegant algorithm, the overhead of spawning a new process on most conventional multiprocessors is too high to permit this to be an efficient implementation. This issue is discussed by Jordan9 for other
40
applications. algorithm was implemented in C augmented with a few parallel-processing primitives (see Figure 3). The parallel-programming primitives are
~~~~~~~~~~~~~~~~The
fO
O.5M
2M
im
*when(condition). (This causes the 254condition is true.) * process(procedure-name(parameters),processor-number). (This is used to

calling process to wait until the specified
processor.) ~~~~~~~~~~~~~~~~~~~specified
start up the specified procedure on the
~ ~ ~ ~ ~~I
O~~~~~~ lock(shared-variable). (An attempt is made to lock a variable in shared memory. ht aibeisntcretl okdb an The rcs,telckwlegatd
T|im
P1
pb
8 MMJ 28 40 . 4(4,53 3-. 312 2:10 41 mj,1 42 3 214 M[-, M9I 30 4~MI~A01 ~2~I8 324 8:7 48 n1 22G
-s-r 390t; -00-000;&-u- 2;t6-0- -&-3 08; - 0 _; t;$0 00040000--00_

5:10 5.10 55 58
-
7; 40_
;4 P3 -
; PS LB
times tracked for two to seven processors. '' t0- a f t00'tV 1 *uncertainty >becauis offthe seS;f10 d =; - Thisr ,--0-f7e .: inherent :t inI parallel processing. To appreciate this
'f
7S14
7 2 30
7:42'
47
M,1
2:24
424
3:33
5:450
72
-1
2:80
ventional machine as a call to an operating system primitive. At periodic ntervals, the 28oeating system checks to see if the condi-
5tion)" statement is implemented on a contion has become true. There is no guaranat
uncertainty,
note that the
"when(condi-
ttee that a process will get its "when" signal

precisely
the
same
instant every time the
_220
Sive n paall th o exeutin 7. An' Figure ffme = 104, only oneprocessor utilizaffon o processors. A~fter 81tion illustrates>poor ~~~~~3:5 83 active. (T54notation 5"-" rep nts
51 -
~~3:4
is executed. Because of this, the ~~~~~~~~~~~~~~program 5:9signal to a slave to proceed may be sent wicslightly later during one run than in
The master may assign slightly another.load to one slave than to another. heavier This spread ofpri times does not occur for runs with one processor because there is essentially no parallel processing in that case. Nor does it occur for runs with very
a
large numbers of processors, since there is enough parallelism in the problem. sixproces-not &1m9 ~time beyond about the8 to machine registers and to obtain 89ss is idleslave waiting to be disan (Theg becausetheslavessievin is This verygoodarray-access 2:mes8 as indicated isf risoithtalm - erypatchedidlrei for a new prime whenever the dsueinh by71 the0 timings that are7.
79 -
8ords, -eleasea locked
'~~~~~~~~~~~~~~~~~~master so requires.)
10~~ '2~~132 3110
~~Load balanc g
To imnprove the utilization of processors
014
w-28
reduce the running time, I develtation time. This is illustrated in Figure and thus following sections. toped a load-balancing algorithm and mea12r a problem with seven procesors It an sured its performance. The load-balancing strategy is for each idle slave to take half The healty loaded slave. loadoftems 1 numbers. Each processors, thetotaltimetosieveprimes memorthe and 2-illion sieve 0.5,-1, This load-balancing phase is not started the master signals the completion of set o c s ro a e nuntil the phase (which is to say, the
105. . - 2:188
'~ ~ ~ ~ ~ ~ ~
miaster has swept through the first 'IN
dlispatching
exeFigure 7. An execution of the parflel Sieve in whch N es). andi p 7i This.umba i 200 one cution illustrates poor utiliztion of processors. After time = 104, only one processor shosjone instanoceso of 8t Faaningur is active. (The notation '" represents an idle processor. Some sequences of entries balain. Thed iame 50 pressor 1h takes
instancingad
loade
hAvepreeniomittedlfor brevity.)
off half the long "tail" in Figure 7. ting The load-balancing step is executed by
* nlock(shared-variable). (In other words, release a locked variable.) I was able to take advantage of C's access to machine registers and to obtain very good array-access times, as indicated by the timings that are discussed in the following sections.
*The time to run the algorithm decreases as the number of processors increases. However, there is no decrease in the running timne beyond about six processors. This is because the slaves sieving the first few primes dominate the total computation timne. This is illustrated in Figure 7
Observed timings for the basic algo-1 I& measured.. Figure show the tie rithut.
-
for a problem with seven processors. It can be seen that the slave sieving 2 goes on for the longest time. while the others finish
each processor as it becomes idle. (In Figure 8, I show ju'st one load-balancing step so as to avoid a congested diagram.) There is a stiff cost for load balancing. This arises because each processor's limit and current location must be accessible to every other processor and thus must be stored in shared memory. This is in contrast with the basic algorithm (see Figure 3) in which each processor's location ("place") and limit ("l1ocaLUimit") can be
two algorithms in this figure. It can be seen that the runtime of the load-balancing algorithm is initially higher than that of the basic algorithm, but that the difference decreases as the number of processors incteases. After about 10 processors are employed, the benefit of load balancing
exceeds the cost of using shared memory and the running time of the new algorithm drops below that of the old. It is also noteworthy that there is a significant "spread" of runtimes for all numbers of processors greater than 1. This is because the load-balancing algorithm
has greater parallelism (with the attendant uncertainty) than the basic algorithm. This is shown in detail in Figure 10, which shows portions of plots for 0.5, 1, and 2 million numbers.
and predicted timings
C of o'bserved Companlson
__
Z:il
In order to verify my understanding of the various factors that influence the performance of the basic algorithm, I first obtained times for several fundamental performance parameters of the Flex and then ran a mulation of the parallel Sieve. I felt that the extent to which simulated results with actual experiments would in~~~~~~~~~~~~~~~~~~~~agreed ~~~dicate to me how well I understood the x performance of the Flex. My timing experiments yielded the following numbers. Time to access a location in shared memory in a "For" loop is 10 microseconds. The time for one process to signal another is 2 milliseconds (this signaling capability is currently implemented in software). The time to spawn a process is 13 milliseconds. This last number appears excessive, but it has a minor impact on the experiments because I spawn each ~~~~~~~~~~Sieve only ~~~~~~~~~~~~~slave once during the lifetime of the program. These performance characteristics were used to develop a simulation of the basic algorithm. I also used the memory access times to obtain lower bounds on the time required to sieve. This was done by run~~' ~' ~~ ' ning a conventional serial Sieve of Eratosthenes algorithm, counting the number of times any memory location was referenced, El ~~~and dividing by the number of processors. Figure 11 compares the simulated timings and lower bounds with the aVerages of 15 runs each of the basic algorithm and of the load-balancing algorithm. These curves are for a problem of size 1 million numbers sieved. The measured timings are the averages of the timings ~~~~~~~~~~~~~~~~~shown ~~~~~~~~~~~ ~~~~~~~~~~shown in Figure 9. The following subsecm ~~~~~~~~~~d4~~~~~~~~~ ~~tions summarize the results.
seen in FigurelIlIthat the simulated and observed run times of the basic algorithm
'K
with simulated results. It is ~~~~~~~~~~~~~~~~~Comparison
agree completely for p> 7. For 1 < p <
and decreases smoothly as the number of processors increases. This occurs because as the number of processors increases, the disruption caused by process switching on one processor becomes a smaller and smaller fraction of the total work done by all processors.
A lower bound on the time to sieve. The lowermost curve in Figure 11 plots the lower bound on time to sieve 1 million numbers. This curve was obtained by counting the number of a serial Sieve algorithm's accesses to memory and dividing by the number of processors. Clearly, no parallel algorithm can do better than this lower bound. It can be seen that this curve is very close to the simulation curve atp = 1, but that the two diverge as the number of processors increases. This is, of course, because the simulation takes signaling and spawning time into account, while the lower bound only counts the number of memory accesses. Beyond p = 6, the simulation curve flattens out because of the lack of parallelism. Comparing the lewer-bound curve with the load-balancing curve shows that the two have roughly the same shape. Atp = 1, the load-balancing algorithm is about 1.65 times greater than the lower bound. This fa p r 2.4 for This factor increases smoothly to = 17. The initial difference atp = 1 arises because the lower bound assumes registerindexed memory access while the loadbalancing algorithm uses indirection through a shared-memory location to access the shared array. The overhead of task switching also contributes to this factor,as it does in the algorithm that does not incorporate load balancing. This factor increases because of the increasing overhead of load balancing as the number of processors increases. It may be possible to imnplement a more sophisticated load-balancing algorithm that uses registers for the initial computation and shared memory locations during a second phase. Such an algorithm would be able to approach the lower bound more closely.
Algorithm with load balancing
Basic alorithm
ibalancing algorithm for N = 1M (N = numbers sieved).

algorkh
Figure 9. Compawson of observed times of the basc algorithm and the load-
agorihm
\_
have described a parallel implementation of the Sieve of Eratosthenes on the Flex/32 multiprocessor and discussed its performance as a function of the _
to sieve a given range of numbers with the basic algorithm decreases as the number of Figure 10. Detail of plots of observed times of the basc algonithm and the loadprocessors utilized increases. However, balancing algorithm for N = 0.5M, 1M, ZM (N = total numbers sievred). April 1987
11
-
4. M.L. D'Ooge (translator), Nicomachus of Gerasa: Introduction to Arithmetic,
Load balancing
-
--- - - -
Basic (observed)
Figure 1Aeg oosBasic (simulated)

-p -w -
Lower bound
5. Macmillan, G.A. Miller, Historical 1926. Introduction to Mathematical Literature, Macmillan, New York, 1916. 6. E. Shapiro, "Concurrent LogicProgramming Techniques," tutorial presented at the 1985 International Conference on Parallel Processing, St. Charles, I(=., Aug. 23, 1985. 7. D.E. Smith, History of Mathematics, Vol. 1, Ginn, Boston, 1923. 8. N. Matelan, "The Flex/32 MultiComputer, " Proc. 12th Int'l Symp. Computer Architecture, June 17-19, 1985, Computer Society Press, Los Alamitos, Calif., pp. 209-213. 9. H.F. Jordan, "Structuring Parallel Algorithms in an MIMD Shared Memory Environment," Parallel Computing, Vol. 3, No. 2, May 1986, pp. 93-1 1 0.
New York,
Figure 11. Averages of observed times of the two algorithms for N ItM, as compared with simulation and lower bound results (N = total numbers sieved).
this decrease flattens out at about six processors. The load-balancing algorithm, which is more complex, is initially slower, but beats the basic algorithm after about 10 processors. I have also compared the performance of the basic algorithm with a simulation. There is good agreement between the two, indicating that the machine's performance is well enough understood that running times can be predicted with 'fair accuracy. In a graph, I also compared some plots of performance with a plot of the theoretical lower bound for sieving a range of numbers. The proximity of the experimental runtimes and the theoretical bound is a
I feel that the parallel algorithm for the Sieve will serve as a useful test of some aspects of multiprocessor performance. Its implementation on other commercially available multiprocessors will provide interesting comparisons. El
measure ofmeasure the machine's ofthemachine's efficiency. efficiency.
S. Bostic were generous with their time and patience in helping me use the Flex. Discussions ith J.H. Saltz, D.M. Nicol and H. Jordan were very useful, and V.K. Naik continued in his role as "C reference of last resort."
plications in Science and Engineering, NASA the Dept. of Electrical Engineering at the Langley Research Center, Hampton, Virginia. Pakistan University of Engineering and I am indebted to R.G. Voigt for his unceasing Technology, Lahore, since 1980; he is currently encouragement of this research through some an associate professor. During 1978-79 and very difficult times. T. Crockett, G. Wright and again during 1984-86, he was a visiting scientist
at the Institute for Computer Applications in Science and Engineering, NASA Langley Research Center. His research interests include paralhel and distributed systems, performance evaluation, and computer architecture. He received the Best Presentation Award at the 1981 International Conference on Parallel Processing, which was sponsored by the Computer Society of the IEEE and held in Begaire,Michigan. Bokhari is a member of the Association for Computing Machinery and the Computer Society of the IEEE, and is a senior member of yhathe IEEE. received the BSc degree in electrical engineering from the Pakistan University of Engineering and Technology, Lahore, in 1974. He received the MS and PhD degrees in electrical and computer engineering from the University of Massachusetts, Amherst, in 1976 and 1978, respectively.
residence at the Institute for Computer Ap
Shahid H. Bokha has been on the faculty of
REferuatinofthes Reethennces
1.AD.
2Ra
~~~He
Acknowledgments
This research was supported by NASA Contract No. NASI-18107 while the author was in
Hawkins, "Mathematical Sieves," ScientificAmerican, Vol. 199, No. 6, Dec. 1958, pp. 105-112. 2. J .R. Edwards and G. Hartwig, "Benchmarking the Clones," Byte, Vol. 10, No. II, Nov. 1985, pp. 195-201. 3. D.A. Patterson, "A Performance Evaluation of the Intel 80286, "Computer Architecture News (ACM SIGArch Newsletter), Vol. 10, No. 5, Sept. 1982,
pp. 16-18.
University of Engineering arid Technology, Dept. of Electrical Engineering, Lahore-3 1,

Pakistan.
Readers may write to Shahid Bokhari at the
58
COMPUTER

5643

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

5643

Uploaded by

Copyright:

Available Formats

Multiprocessing the Sieve of Eratosthenes

Shahid H. Bokhari University of Engineering and Technology Lahore, Pakistan

finding prime numbers tests the capabilities of a multiprocessor.

ndago, Eratosthenes of Cyrene

ore than two thousand years

performance with theoretical lower

mance. I show that the algorithm is sensi-

The Sieve of Eratosthenes

A parallel version of the from 2 to N. Sieve

sweep removes all multiples of 3; and so

G-Q ~~ ~~~~~~~2 2s 2y,. 4:i.'

Figure 2. Architecture of the Flex/32

cessors reside on each of 10 local buses,

figuration of the machine. Palrs of pro-

the current con-

writing the same value (that is, ' ').

accessing LIST, since all slaves are

The basic parallel

*when(condition). (This causes the 254condition is true.) * process(procedure-name(parameters),processor-number). (This is used to

-s-r 390t; -00-000;&-u- 2;t6-0- -&-3 08; - 0 _; t;$0 00040000--00_

5tion)" statement is implemented on a contion has become true. There is no guaranat

note that the

ttee that a process will get its "when" signal

instant every time the

8ords, -eleasea locked

10~~ '2~~132 3110

miaster has swept through the first 'IN

and predicted timings

with simulated results. It is ~~~~~~~~~~~~~~~~~Comparison

agree completely for p> 7. For 1 < p <

Algorithm with load balancing

ibalancing algorithm for N = 1M (N = numbers sieved).

4. M.L. D'Ooge (translator), Nicomachus of Gerasa: Introduction to Arithmetic,

Figure 1Aeg oosBasic (simulated)

residence at the Institute for Computer Ap

Shahid H. Bokha has been on the faculty of

University of Engineering arid Technology, Dept. of Electrical Engineering, Lahore-3 1,

Readers may write to Shahid Bokhari at the

You might also like

G-Q ~~~~~2 2s 2y,. 4:i.'

when(condition). (This causes the 254condition is true.) process(procedure-name(parameters),processor-number). (This is used to

10 '2132 3110