You are on page 1of 9

Vedic division methodology for high-speed very large scale integration applications

Prabir Saha1, Deepak Kumar2, Partha Bhattacharyya3, Anup Dandapat1
1
Department of Electronics and Communication Engineering, National Institute of Technology, Shillong,
Meghalaya 793 003, India
2
Department of Computer Science and Engineering, National Institute of Technology, Shillong, Meghalaya 793 003, India
3
Department of Electronics and Telecommunication Engineering, Bengal Engineering and Science University, Shibpur,
Howrah 711 103, India
E-mail: anup.dandapat@gmail.com

Published in The Journal of Engineering; Received on 19th November 2013; Accepted on 7th January 2014

Abstract: Transistor level implementation of division methodology using ancient Vedic mathematics is reported in this Letter. The potentiality
of the ‘Dhvajanka (on top of the flag)’ formula was adopted from Vedic mathematics to implement such type of divider for practical very large
scale integration applications. The division methodology was implemented through half of the divisor bit instead of the actual divisor, sub-
traction and little multiplication. Propagation delay and dynamic power consumption of divider circuitry were minimised significantly by stage
reduction through Vedic division methodology. The functionality of the division algorithm was checked and performance parameters like
propagation delay and dynamic power consumption were calculated through spice spectre with 90 nm complementary metal oxide semicon-
ductor technology. The propagation delay of the resulted (32 ÷ 16) bit divider circuitry was only ∼300 ns and consumed ∼32.5 mW power for
a layout area of 17.39 mm2. Combination of Boolean arithmetic along with ancient Vedic mathematics, substantial amount of iterations were
reduced resulted as ∼47, ∼38, 34% reduction in delay and ∼34, ∼21, ∼18% reduction in power were investigated compared with the mostly
used (e.g. digit-recurrence, Newton–Raphson, Goldschmidt) architectures.

1 Introduction
thereby it cannot be optimised like a parallel multiplier [13]. The
Division is a fundamental operation in many scientific and engin- drawback of these methods is operands should be previously nor-
eering applications, like arithmetic computation, signal processing, malised, most used primitive are multiplications and the remainder
artificial intelligence, computer graphics etc. [1–3]. Generally, com- is not directly obtained.
putations of such division operations are calculated in sequential In algorithmic and structural levels, substantial amount of div-
manner, thereby costlier in terms of propagation delay (latency) ision techniques has so far been developed to reduce the propaga-
compared with other mathematical operations like addition, subtrac- tion delay and power consumption of the divider circuitry; by
tion and multiplication [4]. reducing the iteration, aiming towards high-speed operations,
Substantial amount of works have so far been investigated by but principle behind division techniques are same in all cases.
various researchers to implement the high-speed divider [1–15] Vedic mathematics [16] is the ancient system of mathematics
like digit recurrence (DR) methodology (restoring [1, 3, 5], non- which has unique computation techniques based on 16 sutras (for-
restoring [2, 6, 9]), division by convergence (Newton–Raphson mulae). Recently, we [17] reported on a Vedic divider based on
(N–R) method [10–12]), division by series expansion ‘Nikhilam Navatascaramam Dasatah’ for some specific number
(Goldschmidt (G–S) algorithm [13, 14]) etc. Generally, division system, like, the divisor was chosen very close to the base of
architectures can be classified into two categories: namely (i) iter- operations. The implementation reduces the number of iterations,
ation based and (ii) multiplication based. Iterative divisions if the divisor is closer to the base of operation, otherwise
consist of shift-and-subtract operations, generates one quotient increases the iterations, a serious bottleneck of the algorithm. In
bits, in each of the iterations, like radix-2 restoring and non- this Letter, we report on a division technique and its transistor
restoring division. Thereby, in iterative division, after each subtrac- level implementation of such circuitry based on such ancient
tion cycle, it should require to check whether the resulting remain- mathematics. ‘Dhvajanka’ is a Sanskrit term indicating ‘on top of
der is lesser than the divisor or negative. The cost in terms of the flag’, is adopted from Vedas; formula is encountered to imple-
computational complexity of DR algorithms [1–3, 5, 6, 9] is low ment the division circuitry. In this approach, divider implementa-
because of the large number of iterations; therefore latency tion was transformed into just small division instead of actual
becomes high. Although, some of the researcher rely on higher divisor, subtraction and few multiplication, thereby reduces the
radix implementation of DR algorithm [6, 7, 10] to reduce the itera- iterations, owing to the substantial reduction in propagation
tions, therefore the latency becomes improved from earlier reports delay. Transistor level (application specific integrated circuit
[1–3, 5, 9], but these schemes additionally increases the hardware (ASIC)) implementation of such division circuitry was carried out
complexity. Some other attractive ideas are based on functional by the combination of Boolean arithmetic with Vedic mathematics,
iterations, like N–R [10–12] and G–S [13–15] algorithm, utilises performance parameters like propagation delay, dynamic switching
multiplication techniques along-with the series expansion, where power consumption calculation of the proposed method was cal-
the amount of quotient bits obtained in each of the iterations is culated by using spice spectre in 90 nm complementary metal
doubled. These methods converge quadratically towards the quo- oxide semiconductor (CMOS) technology and compared with
tient when the number of iterations is increased, thereby latency other designs like DR- [9], N–R- [11], and G–S [15]-based imple-
becomes high. Each iterations of N–R and G–S methods involve mentation. The calculated results revealed (32 ÷ 16) bit divider cir-
two dependent multiplications; namely, the product of the first cuitry has propagation delay ∼300 ns with ∼32.53 mW dynamic
multiplication is one of the operands of the second multiplication switching power for a layout area of 17.39 mm2.

J Eng 2014 This is an open access article published by the IET under the Creative Commons
doi: 10.1049/joe.2013.0213 Attribution License (http://creativecommons.org/licenses/by/3.0/)
1

shown in Fig. 3). In this Letter. The contributions of sidered as 38 982 (five digit number) and divisor is equals to 73 mathematician in the field of number theory. again it is subtracted by the calculated. 16). This result is divided again by 16. Quotient becomes 3 and remainder becomes 3 16. 1 Illustration of ‘Dhvajanka’ sutra a Small divisor with exact division (remainder ‘0’) b Large divisor (remainder ≠ 0). our actual gross dividend is 77 is subtracted by the value obtained by multiplying previous quotient (i. The process continues until the number of iterations 7. have been considered for illustration purpose 2 Vedic division methodology 2. 1a. the result is divided by 16.org/licenses/by/3. Here. 1a Implementation steps of Fig. Thus. dividend has been con- mathematical science are not well recognised. In the fourth stage gross dividend is 21. In the third stage. we say quotient is equals to 534 and remainder is equals to 0 6. In the third stage.2013. Quotient becomes 3 and remainder becomes 5 4. Table 1 Chart implementation procedure. Actual gross dividend is subtracted by 12 and the result in the cross-multiplication of the digits of the quotients and LSD parts of the final remainders divisor and results become 9. In this step. Out of divisor 73. the results become 83. Thus. we allot two places (at the right end of end of the dividend) to the remainder portion of the answer and mark it the dividend) to the floating point portion of the answer and mark it off from off from the digit by a vertical line the digit by a vertical line 2. ‘on top of the flag’. {39 − (5 × 3) = 24}. This is the final stage for this example. 7). 1b (formulae). After subtraction.e. our actual gross dividend is 39 is subtracted by the value 3.0213 2 .Fig. divided by 7.0/) doi: 10.1049/joe. 1b 1. the result is again divided by result is again divided by 7. the (i. The diagram implementation procedure has been implement the division algorithm and its architecture. 3). 8) with LSD part of divisor digit (LSD) of divisor (i. described in Table 1. 2. dividend tials from Vedic primers and showed that the mathematical operations has been considered as 135 791 and divisor has been considered as can be carried out mentally to produce fast answers using the sutras 1632. Two digits have been put on top.1 Numerical example of ‘Dhvajanka’ sutra The gifts of the ancient Indian mathematics in the world history of With the help of example. in the form of Vedic sutras (formulae) [16] are digit (i. 1a is to be set by 7. In this step. 5) with least significant obtained by multiplying previous quotient (i. On the other hand. 38 is divided by the most significant digit (MSD) of the divisor (i. 1b.e.205 This is an open access article published by the IET under the Creative Commons J Eng 2014 Attribution License (http://creativecommons. 135 is divided by the MSD part of the divisor (i. 3) significant for calculations. One digit of divisor has been put on top. the gross dividend is 59. the example has been considered from Fig. Quotient is 8 and Quotient is 5 and remainder is 3. shown in Fig.e. Thus. Thus. we put down only the first Thirthaji Maharaja’. Again it is subtracted by (3 × 3) similar to the previous step. the final remainder is 5. Again it is subtracted by 9 4. This remainder will be used for next step remainder is 7. After subtraction. The entire division for Fig. 1 Implementation steps of Fig.e.e. the result (38 − 9 = 29) is cross-multiplication of LSD part of the divisor and the obtained quotient. {77 − (8 × 3) = 53}. (59 − (3 × 3 + 8 × 2) = 34). the gross dividend is 38. Quotient is 4 and remainder is 1 that is. 7) in the divisor column and put the other digit (i. quotient becomes ‘0’ and remainder becomes ‘9’ 6. ‘Sri Bharati Krsna (two digit number).e.e. we allot one place (at the right 1. This remainder will be used for next step division division 3. and for Fig.e. Quotient is 2 and remainder is 2 5. we report only ‘Dhvajanka’ formula to is to be set by 16. He had explored the mathematical poten.

e. this 3x 2 is equals to 30x which (with –x + 2) gives us 29x + 12 as the last step dividend.0213 Attribution License (http://creativecommons. f (x)/g(x) = remainder 3x 3 + 9x 2 − 15x 2. our first quotient digit is 3 Consider dividend f (x) = a3x 3 + a2x 2 + a1x + a0 and divisor g(x) = 5x 2. thereby obtaining x − 10 as the remainder. where ‘x’ is radix. Algebraically. by means of + an−4 xn−4 + · · · + a3 x3 + a2 x2 + a1 x1 + a0 which 38 982 is sought to be divided by 73. Now. where. 2.3 Mathematical modelling of ‘Dhvajanka’ sutra  Let us assume the numbers A = n−1 i i=0 ai x is dividend. (see (2) and (3)) dend is represented as 38x 3 + 9x 2 + 8x + 2. we multiply the divisor by second Then f(x) = Q(x)g(x) + R. and m−1 Fig. where x stands (1) for 10. 3. we obtain the product 28x + 12. .) /bm−1 /bm−1 b0 × n−1 xn/2 + n−2 x + ··· + 0 bm−1 bm−1 bm−1 ⎛   ⎞ ⎛ ⎛   ⎞ ⎞ a a3 a2 − 3 b0 a2 − b ⎜ b1 ⎟ ⎜ ⎜ b1 0 ⎟ ⎟   ⎜a1 − b0 ⎟ ⎜ ⎜a1 − b0 ⎟ ⎟ a ⎝ b1 ⎠ ⎜ ⎝ b1 ⎠ ⎟ a2 − 3 b0 ⎜ ⎟ a3 2   b1     ⎜ ⎟ x b1 x + b0 + x b1 x + b0 + b1 x + b0 + ⎜ ⎜a0 − b0 ⎟ ⎟ b1 b1 b1 ⎜ b1 ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ = (4) b1 x + b0 ⎛ ⎛   ⎞⎞ ⎛ ⎛   ⎞ ⎞ a3 a a2 − b a2 − 3 b0 ⎜ ⎜ b1 0 ⎟⎟ ⎜ ⎜ b1 ⎟ ⎟ ⎜ ⎜a1 − b0 ⎟ ⎜a1 − b0 ⎟ ⎜ ⎝ b1 ⎠⎟⎟ ⎜ ⎜ ⎝ b1 ⎠ ⎟ ⎟ ⎜   ⎟ ⎜ ⎟ ⎜a3 2 a − (a3 /b1 )b0 ⎟  ⎜ ⎟ ⎜ x + 2 x+ ⎟ b1 x + b0 ⎜a0 − b0 ⎟ ⎜b b1 b1 ⎟ ⎜ b1 ⎟ (5) ⎜ 1 ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ ⎝ ⎠ = + b1 x + b0 b1 x + b0 J Eng 2014 This is an open access article published by the IET under the Creative Commons doi: 10. 2. 2 Algebraical proof of the formula B = i=0 bi x is divisor. x is being 10. 2. and (5)) 2.2 Algebraic proof of ‘Dhvajanka’ sutra A = an−1 xn−1 + an−2 xn−2 + an−3 xn−3 Algebraic proof of the formula is shown in Fig. where x is the radix of the number. Q(x) is quotient and R is           a a a a = an−1 xn−1 + n−1 bm−2 xn−2 + · · ·+ n−1 b2 x(n/2)+2 + n−1 b1 x(n/2)+1 + n−1 b0 x(n/2) bm−1 bm−1 bm−1 bm−1         a a + an−2 − n−1 bm−2 xn−2 + · · · + a(n/2)+2 − n−1 b2 x(n/2)+2 (2) bm−1 bm−1              a a a − a2 − (. quotient and 21x 2 + 9x there from and then obtain 3x 2–x as the remainder.1049/joe. the divi. However. let us proceed with the division in the usual manner. The first step remainder term (i. .0/) 3 . Mathematically. 24x 2) plus 8x being our (see equation (6) at bottom of the next page) second-step dividend. We have to compute f(x)/g(x) with 5x 2.4 Illustration of Dhvajanka sutra 1. . So i ‘A’ can be expressed in terms of ‘B’ as 2.2013. and the divisor is 7x + 3. we obtain the product 35x 3 + 15x 2 and this gives us the the help of ‘on top of the flag’ sutra. Which is actually 30x 2 + 9x 2 − (a3x 3 + a2x 2 + a1x + a0)/(b1x + b0) can be represented as (see (4) 15x 2 = 24x 2. However. 1a. In the first step of the multiplication of the divisor by b1x + b0. If we try to divide 38x by 7x. thus the remainder vanishes. Again multiplying the divisor by 4.) /bm−1 + a(n/2)+1 − n−1 b1 x(n/2)+1 + an/2 − n−1 b0 x(n/2) + · · · + a0 − 1 bm−1 bm−1 bm−1   = bm−1 x(n/2)−1 + bm−2 x(n/2)−2 + · · · + b0             (3) a a − an−1 / bm−1 bm−2 (n/2)−1 a − a1 − a2 − (. . To understand the steps taken from Fig. and subtract this 28x + 12.org/licenses/by/3.

Assume critical path delay ation easily through ‘Dhvajanka (on top of the flag)’ methodology.2 Latency of the divider In this section. otherwise concatenation with next significant bit of dividend. most significant part (L) and Stage 1 contains only comparator [18]. m/2 bit comparator is required. of n bit subtractor equals to 3 × n XOR gate delay. The remainder is again thus total propagation delay equals to (2m + 2) XOR gate delay.org/licenses/by/3. For generation of partial product. 4. 5.1 Implementation procedure the partial product is equal to m/2. with ‘n’ 0). divider implementation algorithm has been dis. Step 2: Determine i=(m/2) bi 2 . otherwise for positive result it is 2) × 3 XOR gate delay.0213 4 . with maximum ‘n’ (for imperfect division) itera- described in Table 2. total bits of divisor. For addition. equal to ‘0’ hence quotient ‘1’ else ‘0’.  m−1  n−1 i=n−(m/2) ai 2 − i i remainder. the quotient is reduced by ‘1’ critical path delay for m/2 bit subtractor maybe estimated as (m/ and set the new quotient bits. assuming one full adder may require 2 XOR gate delay. three stages are required. Thereby. and comparator has been least significant part (R). it may n−1 i m−1 i i=n−(m/2) ai 2 and divisor i=(m/2) bi 2 . 3. The reported architecture divisor (B) considered as n-bit and m-bit. tstage3 is the propagation delay of stage3. then through multiplexer it will set the quotient (Qn) ‘1’ and the remainder is ‘R’. hence the algorithm is shown in Fig. The imple. Again a − a2 − (a3 /b1 )b0 /b1 b0 divide in similar procedure (step 1). architecture has been implemented via (3). tstage4 is the propagation delay of stage4. and tstage5 = propagation let us assume dividend has greater length than divisor. The gation delay of stage2. Divisor has delay of stage5. thereby critical path for to implement m/over2 bit parallel adder is Difference is acting here as remainder. Example 2 has been considered for imperfect division (remain. maximum depth in a column of 3. been broken into two parts. To implement it has been assumed that the length of dividend is greater than length multiplier. Again divide in similar procedure (step 1). Similarly.0/) doi: 10. for division using Vedic mathematics can be computed in five steps mentation procedure using the flowchart diagram has been shown in Fig. ‘2’ stage parallel working as the selector input of the multiplexer. For simplicity purpose (8 ÷ 4) bit divider example has (tpd) can be computed as been considered. a a − (a3 /b1 )b0 a − a2 − (a3 /b1 )b0 /b1 b0 Set the quotient bit Qn−1 and remainder ‘R’. B = i=0 bi 2 . For simplicity purpose. and critical cross-multiplication result of the quotient bits and least significant path of 1 bit subtractor equals to 3 XOR gate delay. tions. Third stage contains only parallel adder promoted to the next stage. = 3 x2 + 2 x+ 1 b1 b1 b1 Step 4: Determine Qn−1 × b(m/2) − 2 + Qn × b(m/2) − 1. The hardware cost of the architecture can be computed based on the cussed leading towards high-speed operation. L is compared with equal number of bits implemented through ‘2’ stage parallel adder and ‘2’ stage XOR of dividend taken from most significant bit (MSB) side. where (ai. 2. Concatenate R and an−(m/2)−1 and         subtract Qn × b(m/2)−1. So the total latency can be computed in terms of the propaga- Example 1 has been considered for perfect division (remainder = tion delay of summation the individual subsection. where two examples have been considered. Concatenate R        and an−(m/2)−2 and subtract Qn−1 × b(m/2) − 2 + Qn × b(m/2) − 1. To execute the division oper- i and n bit subtractor in feedback path. If the gates. For m bit divisor maximum. Fourth stage contains m/2 bit multiplier. dividend is greater than L.5 Flowchart diagram of the algorithm 3. namely (i) partial product gen- of divisor. Suppose the written as first borrow ‘0’. Q( x ) Step 3: Determine Qn × b(m/2)−1. bi ∈ 0. directly divide the dividend bits by L. Where. that is. tpd = tstage1 + tstage2 + tstage3 + tstage4 + tstage5 (7) 3 Divider implementation technique where tstage1 is the propagation delay of stage1. The flowchart of number of complex operations performed in its critical path. example of higher order bit can be implemented in similar manner. tstage2 is the propa- Proposed divider implementation technique is shown in Fig. 1). been implemented. dividend (A) and total propagation delay can be estimated. Critical path to implement a full adder is equal to 2 XOR gate delay. The total propagation delay of the proposed architecture der ≠ 0). (ii) partial product addition and (iii) final addition [18]. and borrow has been equal to (m/2) × 2 XOR = mXOR gate delay. the division algorithm has of n bit. Through the algebraic identity the equations can be re. If result is negative. iterations.1049/joe.  thereby total propagation delay of n bit parallel adder requires Consider m−1 the number A = n−1 i=0 ai 2 i to be divided by n × 2 XOR gate delay. concatenated of next MSD of the dividend and subtracted from the Second stage contains only m/2 bit parallel subtractor. Divide procedure has been implemented through subtractor. If the borrow is adders and ‘2’ XOR stage are required to implement a comparator. it requires m/2 XOR (let us assume XOR gate delay and Step  1: Consider  the most  significant part of dividend ‘AND’ gate delays are equal) delays. require (m/(2 × 3)) × 2 XOR gate. In partial product generation stage. respectively. that is. thereby.2013. eration. Set the quotient bit Qn−1 and and R = a0 − 1 b0 b1 remainder ‘R’. maximum m/2 bit parallel adder is required in each case. m/3 XOR gate for ⎛   ⎞ a3 ⎛ ⎛   ⎞⎞ a2 − b a3 ⎜ b1 0 ⎟ a2 − b ⎜a1 − b0 ⎟ ⎜ ⎜ b1 0 ⎟⎟ ⎝ b1 ⎠ ⎜   ⎜a1 − b0 ⎟⎟ ⎜ a ⎝ b1 ⎠⎟ ⎜ a2 − 3 b0 ⎟ (a0 − b0 ) ⎜a3 2 b1 ⎟ b1 R (6) =⎜ ⎜b x + x+ ⎟+ ⎟ = Q( x) + ⎜ 1 b1 b1 ⎟ b1 x + b0 g(x) ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ This is an open access article published by the IET under the Creative Commons J Eng 2014 Attribution License (http://creativecommons.

0213 Attribution License (http://creativecommons. Dual threshold voltage (VT) operating mode was considered   for simulation to determine the performance parameters. The = (2m + 2) + (3m/2) + 2n + 3n + (3m/2) proper choice of threshold voltages for a particular transistor in + (3m/2) = [5n + (13m/2) + 2] the circuit is based on a number of logics as described below: (i) Placement of high-VT transistors on the leakage path directly XOR gate delay. 3 Flowchart representation of divider using dhvajanka formula partial product addition in first stage.2013. m/2 bit subtractor is required. J Eng 2014 This is an open access article published by the IET under the Creative Commons doi: 10. connected in parallel. than even a single NMOS. thus total addition purpose may be The advantages of CMOS transmission gate (TG) logic over con- approximated as m + (m/2) = (3m/2) XOR gate delay.Fig.0/) 5 .org/licenses/by/3. thereby crit- one p-channel MOSFET (PMOS) and one n-channel MOSFET ical path delay of m/2 bit subtractor equals to (m/2) × 3 XOR (NMOS). current and hence static power. For second stage requires 4 Results and discussion m/6 XOR gate and so on. maximum XOR gate delay equals to [19.1049/joe. the ‘ON’ resistance is smaller gate delay. Thereby n iteration may consume n(5n + (13m/ between supply and ground reduces the subthreshold leakage 2) + 2) XOR gate delay. total propagation delay for each of the iterations may be circuit and architectural levels of design hierarchy have been imple- approximated as mented to reduce the energy delay product (EDP) and power delay product (PDP) for the proposed design. Also for ventional CMOS and complementary pass transistor logic (CPL) multiplication approximated. TGs are used for the design of different modules for faster operation and better logic transform- tpd = tstage1 + tstage2 + tstage3 + tstage4 + tstage5 ation. Proper modifications at the device. 20] logic are well established. In the fifth stage. Thus. As the CMOS TG consists of 3m/2.

step 2 d = 100 i=4≥0 i=4 d = 100. l = 2 L = 11.4.org/licenses/by/3. respectively. (4 ÷ 16). (8 ÷ 4). circuit and architectural levels of design hierarchy d = 01 Q = 101 have been analysed in terms of propagation delay. the final simu- i=2 T = d = 110 lation has been carried out and performance parameters have been d = 10 − (1 × 1 + 1 × 1) = calculated. i = 0 are particularly important when high-speed operation is needed d = 00 − 00 = 00 d = 101 − (1 × 1 + 1 × 1) = 101 and its comparison at 1 V supplies voltage with 90 nm CMOS − 10 = 11 technology. i = 3 d = 100 − (1 × 0 + 0 × 1) = d = 100–(1 × 1 − 1 × 0) = 11 The entire algorithm in this Letter was simulated and their func- 100 − 00 = 100 tionality was examined by spice spectre simulator. Input data were taken in a regular fashion for experi- T = d − 00 T = d = 11 mental purpose. i = 2 d = 100 − (1 × 1 + 1 × 0) = d = d − (1 × 1 + 1 × 1) = 01 − reduces the iteration resulted the reduction of propagation delay 100 − 01 = 11 10 = − Ve and dynamic switching power consumptions.39 mm2. Performance T = d = 100 T = d = 11 parameters like propagation delay and dynamic power consump- step 3 T = T − L = 100 − 10 = 10 T = T–L = 11 − 11 = 0 tions analysis of this Letter was calculated using standard 90 nm Q=1 Q = 11 CMOS technology with 1 V power supply.00 Q = 1011. Proposed architecture T = d = 101 offered ∼47. 11. EDPs and PDPs d = 00 Q = 1011 i=0 d = 101. d = 01 − 01 = 00 d = 110 − (1 × 1 + 1 × 0) = 110 power. average i=1 d = 110. (8 ÷ 8). Example Table 2 Continued 1 has been considered for complete division (remainder = 0). r = 2 result when i = r − 1. adder. bit divider is shown in Table 3. Q = Q − 1 = 11 − 1 = 10 (8 ÷ 4). r = 2 R = 11. then floating Q: = 0 Q: = 0 floating point (bit) start point (bit) start i=7 i=7 Q = 1100. all the individual modules i=i+1=3 such as subtractor. (4 ÷ 16). operated at 250 MHz. Example 2 has been considered for incomplete division (remainder ≠ 0) Steps Example 1 Example 2 Steps Example 1 Example 2 step 9 T = T − L = 101 − 11 = 10 Q = 101101 initialisation A = 10000100 A = 10101010 d = 100 B = 1011 B = 1111 d = 100 − (1 × 1 + 1 × 0) = 11 L = 10.53 mw − 1 = 101 power for a layout area of ∼17. Comparative study between different architectures and 10 − 10 = 00 proposed architecture like (4 ÷ 4). EDP and PDP of different architectures are measured − 1 = 101 and tabulated in Table 3. the application of the Vedic division methodology i=3 d = 01. The individual performance Q = 11 i=i−1=3−1=2 parameters such as propagation delay. Proper modifications step 5 Q = 110 T = T–L = 110 − 11 = 11 at the device. EDPs and PDPs for different circuit modules have − 01 = 110 been computed. ∼38. d = 100 i=3≥0 As shown. The EDP (10−21) J s and PDP T = d = 00 T = d = 101 (10−12) J are quantitative measures of the efficiency and a com- step 6 Q = 1100 T = 101 − 11 = 10 promise between speed and power dissipations.. The propagation delay and switching Q = Q − 1 = 10110. With the help of all the modules.1049/joe. (8 ÷ 8). ∼34% faster operation (propagation delay) Continued than DR [9]. dynamic switching power d = 10 d = 111 − (1 × 1 + 1 × 0) = 111 consumption. i = 1 power dissipation and their products. For each transition. cross-multiplier etc. bits. T = d = 11 T = T + L = 0 + 11 = 11 To implement the Vedic divider like (4 ÷ 4). T = T + L = 00 + 10 = 10 T = 101 (iii) A logical intersection of the conditions illustrated in (a) Q=Q−1=0 T = T − L = 101 − 11 = 10 and (b) requires an optimised choice that leads to the minimum i=5 Q=1 EDP. T = d = 00 (8 ÷ 16) etc.Table 2 Illustration of flowchart with the help of the examples.0/) doi: 10.0213 6 . On the other hand corresponding reduction of power consumption This is an open access article published by the IET under the Creative Commons J Eng 2014 Attribution License (http://creativecommons. i = 0 power are the worst-case delay and power of all possible bit combi- step 8 Q = 110000 d = 110. i = −1 d = 00 It is worth mentioning here that we have taken the implementa- d = 00 − 00 = 00 i=−1 tion methodology from different references [9. were implemented step 4 T = T − L = 11 − 10 = 01 d = 111 through TG to make the circuit faster. the delay is measured from step 7 Q = 11000 T = T − L = 11 − 11 = 0 50% of the input voltage swing to 50% of the output voltage d = 00 Q = 10111 swing.2013. l = 2 T = d = 11 R = 11. 15] and imple- T = d = 00 d = 00 − (1 × 1 + 1 × 1) = 00 − mented in the same technological environments (spice spectre 10 = −Ve with standard 90 nm CMOS technology) and then compared the T = T + L = 00 + 11 = 11 performance parameters. It can be observed from Table 3 (32 ÷ 16) bit squarer d = 110 − (1 × 0 + 1 × 1) = 110 requires ∼300 ns to propagate a signal and consumes ∼32. N–R [11] and G–S [15] architecture. (4 ÷ 8). then when i = r − 1.3. (4 ÷ 8).01 step 1 T = 10 T = 10 i = (i − l) = (7 − 2) = 5 i = (i − l) = (7 − 2) = 5 T = T − L = 10 − 10 = 00 Q=Q×2+0=0 Q=1 d = 101 d = 00 i=4 i=4 d = 101 − 0 = 101 (ii) Placement of low-VT transistors on the signal propagation path d = 00 − (1 × 1 + 1 × 0) = T=d from the input node to the output improves the performance 00 − 01 = − Ve substantially. The values of delay. (8 ÷ 16) etc. i = −1 nations.

Fig.org/licenses/by/3.2013. 5 Latency analysis of divider using dhvajanka formula J Eng 2014 This is an open access article published by the IET under the Creative Commons doi: 10.0/) 7 . 4 Hardware implementation of divider using dhvajanka formula Fig.1049/joe.0213 Attribution License (http://creativecommons.

7 2.805 39.45 36.35 16 ÷ 8 DR [9] 143.155 47.0 8.34 401.68 8÷4 DR [9] 36.12 0.53 mW for a layout ision and it is found that it involves minimum memory space of the area of ∼17.9 4.288 115.44 0. In addition to same architectures.12 17 proposed 50.1 G–S [15] 16.21 26.47 2136.672 47.86 3 691 370 10 498.4 28. with 90 nm CMOS technology.2 G–S [15] 41.8 0.8 proposed 19.5 G–S [15] 10.1 34.86 37.17 4186.71 32 ÷ 8 DR [9] 457.68 6931. For each transition.78 85.58 34 18.45 141.32 71.0486 23.9933 9. Improvement in speed were found to be This is an open access article published by the IET under the Creative Commons J Eng 2014 Attribution License (http://creativecommons. whereas gration applications.95 367 461 2264.4 22 G–S [15] 63. A new division approach based on Vedic mathematics has been owing to the substantial reduction in propagation delay.7 N–R [11] 173. of Architectures Delay.67 8÷8 DR [9] 50.0 7.9 3514. dynamic switching power consumption (mW).16 1.75 39.78 2.97 3372. nS Power. cant operational time.1 17.6 348.8 proposed 10.3 4.08 3 028 331 9221.4 3.06%.516 84.59 131 277.03 G–S [15] 27.0 3. EDP (10−24 J s).28 34.085 29.Table 3 Performance parameters like propagation delay (ns).01 133 535.65 229.05 422 936.92 32.77 262 230.472 33.42 11 183 222.2.59 622. thereby reduces the iteration.28 proposed 26.7 1335. gation delay for (32 ÷ 16) bit division was only ∼300 ns.15 4÷8 DR [9] 19.182 41.9 22.92 16 ÷ 16 DR [9] 198.6 2437.5 proposed 17.38 6.9 1156.org/licenses/by/3.65 37.39 mm2.089 47 42 N–R [11] 29.8 20.6427 39.52 41.79 5 783 807 14 387.83 541.712 39.29 36 N–R [11] 351. mW EDP (10−21) J S PDP (10−12) J Improvement Improvement bits in delay.9 31.55 2. was implemented using L-Edit (T-Spice V-13) and minimum stages for the division.501 159.8 G–S [15] 94.25 5.222 35. % in power. % savings in terms of propagation delay and dynamic switching power consumption compared with proposed methodology.53 2 926 139 9756.9 1 107 353 5035.999 100.95 32 ÷ 4 DR [9] 402.47 40.3 32.77 756.04 17.3 20. an (32 ÷ 16) bit divider implementation was transformed into just small division instead of actual divisor.7 8 209 863 18 053.0 9.34 3.06 proposed 299.0 18.4 N–R [11] 400.0251 34.5 1829.11 37.98 4 518 800 12 021.5 1 605 246 6397.1978 47 39.3 1325.02 71 317.3 25.91 17.62 27.225 74.5 16 083 114 28 215. The architecture has been implemented through spice spectre (T-Spice V13) simulator.46 22 224.8 40.2 22 603.0 1.941 46.31 10.8 35 N–R [11] 68.45 5799. as a function of input number of bits.5348 40 27.82 5 579 002 13 937. PDP (10−12 J).7 N–R [11] 43.39 mm2.7 38 N–R [11] 101.47 855.72 547.432 20.3 proposed 115 11. the delay is measured from 50% of the input voltage swing to 50% of the output voltage swing Input no.4 proposed 219.25 326.84 1.1 N–R [11] 31.08 45.7141 31.6 374.7 N–R [11] 17.2 N–R [11] 487.6 29. % 4÷4 DR [9] 12.2013.77 8 ÷ 16 DR [9] 78.95 G–S [15] 162.8 5.25 9 804 125 20 110.67 699 116.66 167 389.07 6.86 25.0/) doi: 10.7 G–S [15] 117.4661 48.01501 8.58 45.92 601.5 10.7 2.1438 25. In division circuitry. compared with the processor as compared with the conventional method. The propa- proposed for ultra-high-speed and low-power very large scale inte.75 37.9 N–R [11] 125.3 1.79 1.3 8.1432 35.28 33. 6.4 proposed 76.3 12.39 10.41 2.9 37.45 23.2 proposed 250.74 42.25 36 22.3 G–S [15] 328.9 25.398 were ∼34.05 16 ÷ 4 DR [9] 115. the power consumption of the same was 32. The layout of the proposed (32 ÷ 16) bit divider that the proposed algorithm is used efficiently so that it takes shown in Fig.44 1851.93 696.47 86 402.83 4.73 36.99 162.99 101.7 G–S [15] 375.5 1122.4 G–S [15] 29.53 152 484.326 48.5 47.3 13.2 29.2016 12.93 42 237.286 35.348 20.1 2.0 35. which eventually reduces signifi- the corresponding area was found to be ∼17.66 2686.3 34.31 394.01 49.5 proposed 6.775 26 900. respectively.18 0.5 14.41 30.01 4 ÷ 16 DR [9] 33.5 2205. Proposed approach is applied in (32 ÷ 16) div.2 20.92 45 745.25 1873.3 N–R [11] 11.3 26. sub- 5 Conclusions traction and few multiplications.5 26.3 20.2 38.6 proposed 60.03 35.18 G–S [15] 454.2 37.19 37.4 21.95 32 ÷ 16 DR [9] 570.675 33.18 and ∼18.7 0.9 22.0213 8 .78 8 546 707 18 669.8384 40 29.88 109.1049/joe.966 67. ∼21.7 17.5 11 658.422 72.78 37.

System Design. Imbert L. Gu J.: ‘Newton–Raphson algorithms for floating-point division using an FMA’. Matula D. Proc.. pp. Symp. Signals System and Computer. (8)..: ∼47. Proc.J.P. USA. December 2011. pp.2013.-H.. IEEE Int. Swartzlander E. Rennes. Deschamps J.: ‘Ultra low-voltage low-power CMOS division algorithm’.. Sutter G.. Field Programmable Logic and Applications. Proc.Jr: ‘A rounding method to reduce the ∼21. Proc. Deschamps J. rocal’.: ‘Vedic divider: [1] Juang T.4 and 34% for (32 ÷ 16) bit division circuitry. 1703–1708 [16] Maharaja J. Conf.1049/joe. 2001) [17] Saha P. (12). (12). Proc. and square root recip- corresponding reduction of power consumption were ∼34.. 2010. Circuits Syst. pp. Proc.. J. July 2010.: ‘Vedic mathematics’ (Motilal Banarsidass 6 References Publishers Pvt Ltd. Beijing. 200–207 [11] Piso D. 759–763 [15] Kong I. ASIC and embedded system’ (John Wiley & Sons. Li S.: ‘Simplifying the rounding for Newton– Raphson algorithm with parallel remainder’. 2000.org/licenses/by/3.-M. Proc. Banerjee A.. Zhang M. Dandapat A. August 2009. 3. pp... IEEE Int. (Elsevier)... Application Specific Systems Architectures and Processors.. pp. pp.S. 2012.Jr. Computer Science and Information Technology.M. 2001) [4] Hagglund R. 42.. 2004.D.D. 59. pp. Seattle.. whereas ‘Improving Goldschmidt division. IEEE Int.H.. Computer Arithmetic. 1460–1464 [9] Liu W.P. Negi S..M. Comput.. IEEE Int. [3] Deschamps J. Circuits and Systems 2011. pp.A. Conf. 115–122 [8] Jun K.: ‘High speed fixed point divider for FPGAS’. Lowenborg P. Trans. May 2008.. Proc. Belgrade. divider architecture for fast quotient generation’. Bruguera J. FPGA. I. Pacific Grove. (8). June 2003. [5] Aggarwal N.. IEEE Int.: ‘Fast division on fixed-point DSP processors using Newton–Raphson method’.: ‘Power efficient division and square root unit’. Seidel P. Prague.. Comput. 2002.: ‘Division algorithms and implementa. 6 Layout of the proposed (32 ÷ 16) bit Vedic divider. Conf.. April 2009.H.: ‘An improvement in the restoring division algorithm (needy restoring division algorithm)’. Dandapat A.. IEEE Int. Chen S.: ‘Synthesis of arithmetic Microelectron. 705–708 implemented through L-Edit (T-Spice V-13) simulator the corresponding [13] Guy E. Nannarelli A.P. Muller J. Swartzlander E. France. Wei G. Proc. IEEE Int.M.F. pp. Kochi. respectively. November 2005.: ‘A parametric error analysis area was found to be 17. 165–171 [14] Ercegovac M. WA.S. Bhattacharyya P. Symp. 1343–1352 circuits. IEEE Int. IEEE Trans... Proc. Panhaleux A. 61. IEEE Int.T. [19] Uyemura J.: ‘A novel VLSI iterative novel architecture (ASIC) for high speed VLSI applications’. Warren E. Symp. of a high speed low power circuit for calculation of factorial of tions’. pp..Jr. 448–452 [7] Sutter G.2. square root.. 51. Layout was Computers as a Tool. 49. 2012.: ‘Fast radix 2k divider for FPGAs’. IEEE S-based implementation.. 4-2 and 5-2 compressors for fast arithmetic circuits’. Asooja K.. Muller J. 3358–3361 [18] Saha P. pp. Comput. Conf. Conf. IEEE Int. Sao Carlos. Verma S.and G– required multiplier precision for Goldschmidt division’. 67–71 pp. IEEE Trans.: ‘A polynomial-based [20] Chang C. 833–854 4-bit numbers based on ancient vedic mathematics’. Banerjee A.06% compared with DR and N–R. India.F. Proc. 1059–1070 [10] Louvet N.. pp. pp.-B. 1997. IEEE Trans.: ‘CMOS logic circuit design’ (Kluwer Academic Inc. Flynn M. IEEE Trans. 1985–1997 J Eng 2014 This is an open access article published by the IET under the Creative Commons doi: 10.J.18 and ∼18.: ‘Modified non-restoring division algorithm with improved delay profile and error correction’. pp. Conf. 46.S.B. Mladenovic S.. USA.-M. (10). Bioul G. pp. 921–925 [12] Nenadic N. Bhattacharyya P..M... November 2009. Comput. IEEE Int.. pp..0213 Attribution License (http://creativecommons. Signals Systems and Computers. 571–574 Circuits Syst..: ‘ASIC design [2] Oberman S. 2011. Delhi.-P. 2006) Publishers.39 mm2 of Goldschmidt’s division algorithm’.B. Conf. (7). Vesterbacka M. Conf.D.W.K. 246–249 [6] Sutter G. Fig. ∼38.. August 2009. CA. Programmable Logic..E.3.E.0/) 9 .