Professional Documents
Culture Documents
Final Report PDF
Final Report PDF
Abstract—This report presents an efficient realization of less, so here a chance to improve the speed of Givens Rotation
GGR(Generalized Givens Rotation) for QR factorization that technique [1].
achieves more then 3-100x better performance in terms of
Gflops/watt in multicore and GPGPUs as compare to classical A = QR
Givens Rotation.GGR is the advanced form of Givens Rotation
in which multiple elements of a matrix with different row Where
and column annihilate simultaneously. In the case of GGR, the
number of multiplication is less than the GR by 33%. If we are a b c a b c
using the Asynchronous circuit then the latency in the circuit is Q =d e f R =0
d e
surely less then the synchronous circuit. g h i 0 0 f
I. I NTRODUCTION Det|A| = a ∗ d ∗ f
In today life, speed is one of the main dominant factors in
For the implementation of these factorizations on FPGA and
the engineering field but as we know that there is a trade-off
GPGPUs there is some library based approach are used like
between speed and area and power consumption. So we can
BLAS (Basic linear algebra subprogram) based on highly
do optimization to get a point where the device is efficient at
tuned packages specially design for such problem to enhance
high speed. The main question arise is that why we need to
the speed. Tuned packages are MAGMA and PLASMA[2].
do factorization? Doing factorization play a vital role. As we
know that the calculation becomes easy in case of an upper
In this project for matrix simplification Floating point has
and lower triangular matrix.”Easier” means the calculation has
been used. The main reason for using Floating Point is that
become fast in case of the triangular matrix as compared to a
the large computation become easy and also the accuracy of
simple matrix.
floating point is also high. The representation of a floating
QR factorization is a process to decompose any matrix in
point number is followed
two matrices.
A = QR Floating Point Representation
Sign Exponant M antissa
Where the diagonal of R(upper triangular matrix) shows
the eigenvalue and column of Q(orthogonal matrix) shows Example:
eigenvector of matrix A.Product of the diagonal element of 85.125
matrix R shows the determinant of the matrix. The eigenvalue 85 = 1010101
is one of the most important factors of any matrix which 0.125 = 001
shows the solution of that matrix which is used in the different 85.125 = 1010101.001
field of engineering applications like Communication systems, = 1.010101001 ∗ 26
Designing bridges, Designing car stereo system, decoupling sign = 0
of three-phase systems, a solution of linear equations. Ba- For single precision
sically, there are three types of QR factorization technique, biased exponent = 127 + 6 = 133
1) Givens Rotation, 2) Householder Transform, 3)Modified 133 = 10000101
Gram-Schmidth[1]. MGS is used where the accuracy of the Normalised mantisa = 010101001
final solution is not so important.HT is used for the high- The IEEE 754 Single precision is:
performance computing field, HPC is a numerically stable = 01000010101010100100000000000000
process. Givens Rotation is used in embed field where the end Double precision:
solution is critical but since the speed of the Givens Rotation is biased exponent 1023 + 6 = 1029
1029 = 10000000101 architecture.
By using RDP(Reconfigurable Data Path) we can improve
Normalised mantisa = 010101001 the data processing explained in the next section.
Moving towards the highly tuned software the time is taken
The IEEE 754 Double precision is: and no of multiplication reduced, which will cause the hike
= 0100000001010101010010000000000000000........ in performance in terms of Gflop/watt.
Where
√
r = a2 + b2
c = cos θ , s = sin θ
TABLE I
M ULLER C-E LEMENT
A B OUTPUT(Fn )
0 0 0
Fig. 2. One Iteration of Column-wise Givens Rotation
0 1 Fn−1
1 0 Fn−1
1 1 1
B. ASYNCHRONOUS CIRCUIT
We choose the asynchronous circuit because of the
limitation of getting high operating frequency by using From the table, it is clear that the output will be 1 when both
the synchronous clock. Reduction in the clock time period the input will be 1. This element plays a vital role in the
will cause a higher operating frequency as well as using asynchronous circuit to provide synchronization to the circuit
the asynchronous circuit will cause a reduction in power [4].
consumption and interference in the circuit. In a synchronous 4) Data Communication in Asynchronous Digital Circuits:
circuit for reduction of dynamic power clock gating is
introduce but for clock gating, we have to use some other • 2-Phase Protocol:
extra circuitry to achieve the less dynamic power but In 2 - Phase Handshaking Protocol the data will be
simultaneously we are paying area. Due to these reasons, we transfer to the receiver from the sender in two cycles. In
are looking for the asynchronous circuit[4]. 1st cycle the transmitter make the request signal high(1),
which shows the sender is ready to send the data and in
1) Asynchronous Architecture: In the case of asynchronous the next cycle the receiver will make the acknowledge
architecture we do not need any clock cycle. In this case, we signal high which means the receiver is ready to receive
have a sender block and a receiver block. Data transfer has the data and hence data will be transferred successfully
been done by the help of acknowledging and request signal from sender to receiver[4].
as shown below:
But the main problem with this protocol is that it can not
give the reliability because we don’t have information that
2) Difficulties in Asynchronous Architecture: One of whether the information is transfer successfully transfer
the main difficulty in Asynchronous architecture is the or not.
synchronization in a big architecture. • 4-Phase Protocol
Since it is new work so there is not much enough automatic In 4 - Phase Handshaking Protocol data will transfer from
tools to check the functionality, so as a result synthesis is a sender to receiver in 4 cycle. In the first cycle sender
big challenge. assure that he has information available to transmit and
sender make the request signal high. In the second cycle
TABLE II
VI. CONCLUSION
GGR is the modified form which is presented in this paper
has less no of multiplication as compare to the classical GR.
In this paper, we use floating point no which is efficient in the
case of very big and very small no operation. By using the
Asynchronous architecture we can reduce the time required on
a particular section by 51 times of previous one. To implement
Asynchronous architecture we are going to use a modified
4 phase protocol in which the time required is less then the
4 phase protocol and the reliability is more than the phase
protocol. Future scope of this project is to replace floating
Fig. 7. Performance of GGR in Different Packages and Platforms
point no by the posit numbers, because floating point no has
some limitation.
From the above graph, it is clear that the different operation R EFERENCES
with different software shows a different result. But the main
problem of this software is based on the synchronous circuit. [1] Farhad Merchant, Tarun Vatwani, Anupam Chattopadhyay, Senior Mem-
ber, IEEE, Soumyendu Raha, S K Nandy, Senior Member, IEEE,,
To implement these circuit with the asynchronous circuit is Ranjani Narayan, and Rainer Leupers, Efficient Realization of Givens
quite complicated because the software which is used here is Rotation through Algorithm-Architecture Codesign for Acceleration of
written on the basis of the synchronous circuit. QR Factorization, .
[2] F. A. Merchant, T. Vatwani, A. Chattopadhyay, S. Raha, S. K. Nandy,
and R. Narayan, Efficient realization of householder transform through
algorithm-architecture co-design for acceleration of qr factorization, IEEE
V. R ESULTS Transactions on Parallel and Distributed Systems, vol. PP, no. 99, pp. 11,
Implementation of Asynchronous architecture is a big deal 2018.
[3] Z. E. Rakossy, F. Merchant, A. A. Aponte, S. K. Nandy, and A.
because of it’s complexity and unavailability of the automated Chattopadhyay, Efficient and scalable cgra-based implementation of
software to run and calculate the performance of the different columnwise givens rotation, in ASAP, 2014, pp. 188189.
[4] Nikhil Bhandari,Dr.Dr. Rahul Shrestha, Shubhajit Roy Chowdhury”FPGA
Based High Performance Asynchronous Arithmetic Logic Unit and
Asynchronous Finite State Machine Controller using Modified 4-Phase
Handshaking Protocol”.
[5] Mark A. Erle, Senior Member, IEEE, Brian J. Hickmann, Member, IEEE,
and Michael J. Schulte, Senior Member, IEEE,”Decimal Floating-Point
Multiplication” in IEEE TRANSACTIONS ON COMPUTERS, VOL. 58,
NO. 7, JULY 2009.
[6] S. Das, K. T. Madhu, M. Krishna, N. Sivanandan, F. Merchant, S.
Natarajan, I. Biswas, A. Pulli, S. K. Nandy, and R. Narayan, A framework
for post-silicon realization of arbitrary instruction extensions on reconfig-
urable data-paths, Journal of Systems Architecture - Embedded Systems
Design, vol. 60, no. 7, pp. 592614, 2014.
[7] G. Ansaloni, P. Bonzini, and L. Pozzi, Egra: A coarse grained recon-
figurable architectural template, Very Large Scale Integration (VLSI)
Systems, IEEE Transactions on, vol. 19, no. 6, pp. 10621074, June 2011.
[8] F. Merchant, A. Chattopadhyay, G. Garga, S. K. Nandy, R. Narayan, and
N. Gopalan, Efficient QR decomposition using low complexity column-
wise givens rotation (CGR), in 2014 27th International Conference on
VLSI Design and 2014 13th International Conference on Embedded
Systems, Mumbai, India, January 5-9, 2014, 2014, pp. 258263.