You are on page 1of 12

PROJECT REPORT

Title: Parallelizing Singular value decomposition

Description: Singular value decomposition (SVD) is a fundamental operation in linear algebra.


SVD is regularly used in several applications, especially for low rank approximation and
dimensionality reduction.
SVD computation is expensive in general. However, efficient randomized approximation
algorithms exist, especially when the target dimensions are much smaller than the input
dimensions. Such a setting is called truncated SVD where the target dimension is smaller than
the input dimension. The goal is to build a highly efficient, scalable and parallel implementation
for truncated SVD that can scale well to millions of vectors with input dimensions close to
100,000 or more. Such dataset volumes are regularly analyzed in social networks, genomics etc.
There are some publicly available libraries for efficient computation of truncated SVD. However,
they have limited scalability on usual desktop configurations. These could be candidates for
benchmarking our implementation.

Introduction: The Singular Value Decomposition (SVD) of a matrix is a factorization of that


matrix into three matrices. It reveals intriguing algebraic characteristics and imparts significant
geometric and theoretical insights into linear transformations.

Mathematics behind SVD: The SVD of a mxn matrix A is given by the formula A = UΣVT
where,
● U: mxm matrix of the orthonormal eigenvectors of AAT. .
● VT: Transpose of a nxn matrix containing the orthonormal eigenvectors ofATA.
● Σ: Diagonal matrix with r elements equal to the root of the positive eigenvalues of AAᵀ
or AᵀA (both matrices have the same positive eigenvalues anyway).

Applications of SVD: SVD allows an exact representation of any matrix and it is easy to
eliminate the less important data in the matrix to produce a low-dimensional approximation. This
makes SVD a powerful technique with a wide range of applications in various fields.
1. Data Compression: SVD is used for data compression by approximating
high-dimensional data with a lower-rank approximation. This is particularly useful in
image and signal compression.
2. Image and Signal Processing: SVD is applied in image and signal processing for
denoising, feature extraction, and compression. It helps identify dominant patterns or
features in the data.
3. Collaborative Filtering in Recommender Systems: In recommendation systems, SVD is
used for collaborative filtering to predict user preferences based on the preferences of
similar users. Matrix factorization techniques, including SVD, are common in this
context.
4. Natural Language Processing (NLP): SVD is used in NLP tasks such as latent semantic
analysis (LSA) to extract underlying relationships and patterns in large textual datasets. It
helps in identifying the semantic meaning of words and documents.
5. Principal Component Analysis (PCA): PCA, a dimensionality reduction technique, is
closely related to SVD. SVD is used to compute the principal components of a dataset,
which are then used for feature reduction and data visualization.
6. Quantum Mechanics and Quantum Computing: SVD is employed in quantum mechanics
for the analysis of quantum states and operations. In quantum computing, SVD is used in
quantum algorithms for tasks like quantum singular value transformation.
7. Statistics and Machine Learning: SVD is used in various statistical and machine learning
applications. For example, it is applied in regression analysis, feature selection, and
clustering. It can be used for handling multicollinearity in linear regression models.
8. Biomedical Imaging: In medical imaging, such as MRI and CT scans, SVD is used for
image reconstruction, noise reduction, and feature extraction. It helps in enhancing the
quality of medical images.
9. Control Systems and System Identification: SVD is applied in control systems and
system identification to analyze and model dynamic systems. It helps in identifying
dominant modes of a system and designing controllers.
10. Internet of Things (IoT): SVD is used in IoT applications for data analysis and anomaly
detection. It can help identify patterns and trends in large datasets generated by IoT
devices.

Need for Parallelization:


For a matrix of size mxn, the full SVD has a computational complexity of approximately
O(min(m2n,mn2,n3). This makes it computationally expensive for large matrices.
Therefore, parallelizing SVD is imperative in various contexts due to practical challenges and
considerations:
1. Handling Large Datasets: In real-world applications, datasets can be enormous and may
not fit into the memory of a single machine. Parallelizing SVD facilitates the distribution
of computations across multiple processors or nodes, enabling the processing of large
datasets that would be impractical to handle sequentially.
2. Managing Computational Complexity: SVD involves computationally intensive tasks like
matrix factorizations, singular value computations, and matrix multiplications.
Parallelizing these operations substantially reduces overall computation time, making it
viable to perform SVD on large matrices within a reasonable timeframe.
3. Scalability Requirements: As dataset or matrix sizes increase, scalability becomes
paramount. Parallelization allows for efficient scaling of SVD computations across
multiple computing resources, ensuring that the algorithm can handle growing data sizes
without compromising performance.
4. Optimal Resource Utilization: Parallel computing optimally uses multiple processors or
cores, maximizing hardware resources. This is crucial in modern computing
environments where multi-core processors and distributed computing clusters are
prevalent.
5. Real-time Processing Needs: Applications requiring real-time or near-real-time
processing necessitate parallelized SVD. Parallel execution assists in meeting tight
deadlines for tasks such as data analysis, signal processing, and machine learning in
dynamic environments.
6. Parallel Matrix Multiplication Benefits: Matrix multiplication, a fundamental SVD
operation, can become a bottleneck. Parallelizing matrix multiplication, a highly
parallelizable task, significantly contributes to accelerating the overall SVD process.
7. Distributed Computing Environments: In distributed computing setups where data is
spread across multiple nodes, parallelizing SVD enables collaborative processing. Each
node can independently compute its local data, and the results can be combined to obtain
the final SVD.
8. Performance Optimization Objectives: Parallelization serves as a means to optimize the
performance of SVD algorithms. This is crucial in scientific computing, data analysis,
and other domains where efficient algorithms are essential for timely results.

Existing Parallel Approaches:


Parallelizing SVD is a bit challenging due to the inherent sequential nature of the algorithm.
However, there are some methods and techniques to parallelize certain aspects of the SVD
computation. Here are a few approaches:
1. Block Algorithms: Divide the large matrices into smaller blocks and perform SVD on
each block independently. This can be done in parallel as the computations within each
block are independent of the others.
2. Parallelizing Matrix Multiplication: Matrix multiplication is a fundamental operation in
SVD. Parallelizing the matrix multiplication step can significantly speed up the overall
SVD computation. Use parallel linear algebra libraries or frameworks that are optimized
for parallel execution.
3. Parallelizing Power Iteration: Power iteration is often used in the computation of singular
values and vectors. Distributing multiple power iterations across different processors can
be an effective way to parallelize this part of the SVD.
4. Data Parallelism: While dealing with a large dataset, we can consider using data
parallelism. Distribute different parts of the data across multiple processors, compute
partial SVDs independently, and then combine the results.
5. Parallelizing Lanczos Algorithm: The Lanczos algorithm is often used in iterative
methods for SVD. Parallelizing the Lanczos iteration steps can lead to improved
performance.
6. GPU Acceleration: Utilize Graphics Processing Units (GPUs) for parallel processing.
Many linear algebra operations, including those involved in SVD, can be significantly
accelerated on GPUs.
7. Distributed Computing: In a distributed computing environment, distribute the
computation across multiple nodes. Each node can handle a portion of the data, and the
results can be combined at the end.
8. Hybrid Approaches: Combine multiple parallelization techniques. For example, one can
use data parallelism along with parallelized matrix multiplication to achieve better
performance.
Here, the effectiveness of the parallelization methods actually depends on the specific
characteristics of the problem, the size of the matrices involved, and the architecture of the
computing system. Additionally, not all steps of the SVD algorithm may be easily parallelizable,
and achieving optimal performance may require a careful balance between different
parallelization strategies.

Parallel Algorithms: We have explored two parallel algorithms.


1. Dense Matrix SVD Algorithms - Jacobi Singular Value Decomposition and its extensions
like Two-sided Jacobi algorithm, Block-Jacobi modification.
2. Sparse Extreme Singular Values - Lanczos algorithm

Dense Matrix SVD Algorithms: Dense matrix Singular Value Decomposition (SVD)
algorithms are numerical techniques designed to decompose a dense (fully populated) matrix into
three other matrices, revealing the singular values and singular vectors of the original matrix.
Dense matrices are those where most of the entries are non-zero.

Jacobi Singular Value Decomposition (SVD): This is an iterative method for computing the
SVD of a matrix through a series of orthogonal transformations, known as Jacobi rotations. The
algorithm diagonalizes the matrix by iteratively applying these rotations until convergence is
achieved. The resulting diagonalized matrix reveals the singular values and singular vectors of
the original matrix.

● We start with the original matrix A and initialize the orthogonal matrices U and V as
identity matrices. Set the matrix B = A.
● The main loop of the algorithm consists of iteratively applying Jacobi rotations to the
matrix B until convergence. The goal is to introduce zeros in off-diagonal elements.
● A Jacobi rotation is a 2x2 orthogonal matrix that zeros out one off-diagonal element. The
𝑇
rotation is applied to B as B ← 𝐽 𝐵𝐽 where J is the Jacobi rotation matrix.
● After each rotation, convergence is checked by examining the magnitude of the
off-diagonal elements. If the off-diagonal elements are sufficiently close to zero, the
algorithm is considered converged.
● After each successful rotation, the orthogonal matrices U and V are updated by
accumulating the rotation matrices.
● Repeat the process until convergence is achieved. The diagonal elements of the matrix B
become the singular values, and the columns of U and V contain the left and right
singular vectors, respectively.
● The final result is the decomposition A = UΣVT where U and V are orthogonal matrices,
and Σ is a diagonal matrix with the singular values on the diagonal.

Two-sided Jacobi algorithm: The two-sided Jacobi scheme, also known as the two-sided
Jacobi algorithm, is an extension of the Jacobi SVD algorithm that simultaneously applies Jacobi
rotations to both the left and right singular vector matrices. In the standard Jacobi SVD
algorithm, the rotations are applied separately to the left singular vector matrix U and the right
singular vector matrix V. The two-sided Jacobi scheme, however, performs joint rotations on
both U and V in each iteration. This simultaneous rotation of both matrices leads to faster
convergence.
The two-sided Jacobi scheme offers improved numerical stability and convergence properties
compared to the one-sided Jacobi algorithm. However, it comes with increased computational
cost due to the simultaneous rotations.

Block-Jacobi modification: It is an enhancement of the traditional Jacobi SVD algorithm. It


involves partitioning the original matrix into smaller blocks and applying the Jacobi rotations to
these blocks simultaneously. This modification is aimed at improving the efficiency and
parallelizability of the SVD computation, especially for large matrices.
● Instead of applying Jacobi rotations to individual elements or 2x2 submatrices, the matrix
is divided into blocks. These blocks can be square or rectangular and are treated as
separate entities for the purpose of rotations.
● Jacobi rotations are applied simultaneously to the blocks, allowing for parallel processing
of different portions of the matrix. This can lead to increased parallelizability and
improved computational efficiency.
● Similar to the standard Jacobi algorithm, convergence criteria are used to determine when
the SVD computation has reached a satisfactory result. Typically, these criteria involve
examining the magnitude of off-diagonal elements within the blocks.
● After each iteration, the updates to the singular value decomposition are performed based
on the rotations applied to the blocks. This includes updating the singular values and the
left and right singular vectors.
● The iterative process of simultaneous rotations and updates is repeated until the
convergence criteria are met.
Improvements observed:
● Parallel Processing: The block-wise decomposition allows for parallel computation of
rotations on different blocks, which leads to improved efficiency, especially on parallel
computing architectures.
● Reduced Communication Overhead: In distributed computing environments, the use of
block-wise decomposition can reduce the need for communication between processors,
potentially improving scalability.
● Increased Numerical Stability: The use of larger blocks may lead to increased numerical
stability, particularly for matrices with specific structures or characteristics.

Explanation of the implemented code:


Step 1- Matrix Generation: matrix-generation.py
The Python script generates a random dense matrix based on the number of rows and columns
provided as command line arguments, and then writes the matrix data to a file. Additionally, the
script can print the transposed matrix if a specific flag is provided as a command line argument.
Step 2- SVD algorithm implementation: svd.cpp
● Checks if the command line arguments contain the number of rows and columns for the
matrix generation.
● Initializes variables to store the number of rows and columns for later use.
● It opens a file called 'matrix' for writing the matrix data.
● Creates two matrices filled with zeros: one with dimensions M x N and the other with
dimensions N x M.
● Iterates through the matrices, generating random numbers and assigning them to the
respective positions.
● Writes the matrix data to the file, separating each number with a tab and each row with a
newline.
● Prints the transposed matrix if an additional command line argument '-py' is provided.
● Prompts the user to input two numbers if the command line arguments do not contain the
required input.
Step 3- Parallel SVD implementation: parallel-svd.cpp
● The program begins by including various standard C++ and external libraries for
functionalities like I/O operations, time tracking, and OpenMP directives. It also defines
constants and declares a helper function for determining the sign of a number.
● The main function takes command line arguments for the size of the matrix and potential
options for parallel processing or debugging.
● It checks the existence and correctness of the command line arguments, ensuring that the
matrix size is specified and that the matrix is square.
● The program creates and initializes various arrays and variables for performing matrix
computations, including matrices matrix_U, matrix_V, matrix_Ut, matrix_Vt, and
matrix_A, as well as arrays matrix_S, index1, and index2.
● It allocates memory for these arrays using dynamic memory allocation and initializes
certain variables used for computation.
● The program proceeds to read from a file named "matrix" to populate the transposed
matrices matrix_Ut and matrix_Vt.
● It then performs the main computation, which may involve matrix operations and
potentially utilizes OpenMP parallelization directives for parallel processing.
● The results are displayed, showcasing the matrices matrix_U, matrix_V, and the diagonal
matrix matrix_S.
● Additionally, the program may generate Octave files for debugging purposes if the
corresponding command line options are provided, allowing for further analysis and
visualization of the matrices.
● After these operations, the program deallocates the memory that was previously allocated
for the arrays and matrices, ensuring clean-up of resources.
Step 4- Comparison of matrices to check the validity of SVD: comparison.cpp
● The program starts by including necessary C++ libraries and declares the main function
that takes in command-line arguments.
● It initializes variables and arrays to store the results of the matrices and computes the sum
of absolute values and differences to compare the two sets of matrices.
● It reads matrix data from files and validates if the files are open before reading. If the
files are not open, it displays an error message and exits the program.
● The program then allocates memory for the matrices and reads the matrix data into the
arrays from the corresponding files.
● It computes the sum of absolute values for each matrix and then computes the differences
between the sums as part of the comparison process.
● Based on the differences computed, it validates the matrices and prints whether they are
"VALID" or "NOT VALID".
● Additionally, if a specific command-line flag (-p) is provided, it prints the differences in
the matrices.

Sparse Extreme Singular Values: Sparse matrices warrant more specialized SVD computation
techniques, as full decompositions are infeasible. Subspace iterations, Lanczos/block-Lanczos
methods, trace minimization and Davidson-based procedures mainly find a few extreme singular
values/vectors accurately.
Working with the normal matrix ATA or an augmented matrix yields the desired singular values
as interior eigenvalues. This causes slow convergence for eigenvalues based methods. While
subspace iteration suppresses unwanted eigenvalues via polynomial transformations, Lanczos
cannot.
Alternatively, ATA's smallest eigenvalues equal the matrix's smallest singular values squared.
This allows trace minimization methods to converge correctly. A quadratic minimization
formulation leads to independent linear systems that refine search subspaces. Chebyshev
acceleration, Ritz shifting and left singular vector refinement further aid convergence.
We have explored Lanczos algorithm in our project.

Lanczos algorithm: The Lanczos algorithm is an iterative method used for finding a few
eigenvalues and corresponding eigenvectors of a large, sparse symmetric matrix. It is particularly
well-suited for problems in numerical linear algebra and is often applied in the context of
eigenvalue problems, such as those encountered in quantum mechanics, structural mechanics,
and other scientific simulations. The Lanczos algorithm was originally developed by Cornelius
Lanczos in the 1950s.
● We start with a vector v1 of random or predetermined initial values.
● Iteratively apply the Lanczos procedure to generate a tridiagonal matrix T that is similar
to the original symmetric matrix A.
● Orthogonalize the vectors generated during the iteration process to maintain
orthogonality and numerical stability.
● The Lanczos iteration produces a tridiagonal matrix T that approximates the original
symmetric matrix A. The eigenvalues of T are approximations of the eigenvalues of A
and the Lanczos vectors provide corresponding approximations of the eigenvectors.
● Use standard methods, such as the QR algorithm, to compute the eigenvalues and
eigenvectors of the tridiagonal matrix T. These eigenvalues are then considered as
approximations of the eigenvalues of the original matrix A.
● The Lanczos algorithm can be iteratively restarted to improve accuracy or to find
additional eigenvalues. The algorithm is typically terminated when the desired number of
eigenvalues or a specified convergence criterion is reached.
The Lanczos algorithm is particularly useful for large, sparse matrices because it avoids explicit
diagonalization of the entire matrix. Instead, it works with a small, tridiagonal matrix,
significantly reducing the computational cost. The method is efficient in situations where only a
subset of eigenvalues is needed.

Parallelization Approach: A root process reads the entire sparse matrix and assigns matrix
columns to processes for load balancing, minimizing communication. Column partitions are
picked to balance non-zero elements across processes while avoiding fractions of columns.
Processes then independently perform Lanczos computations on their sub-matrices and gather
intermediate results for reorthogonalization. MPI derived data types and collective routines are
used for efficient data transfers.
Experiments: Test matrices with varying sizes and density (fraction of non-zero entries) are
generated to simulate real document corpora. Constant size, growing density and constant
density, growing size cases are tested on a nine node Linux cluster.

Results: The experiments reveal that speedups are only observed beyond a density threshold,
below which parallel overheads dominate. This manifests as relative speedups for higher density
matrices, and occasional slowdowns for sparse cases.
Analysis suggests data distribution and messaging costs for reorthogonalization determine
efficiency. These scale with matrix dimensions. But computational load is also proportional to
matrix density. So sufficient density absorbs overheads allowing scalability.
The tests also demonstrate that keeping density fixed, increasing matrix size consistently
improves speedups. This aligns with Amdahl’s law, as larger problems make parallelizable
Lanczos portion more dominant.

Performance Comparison of Jacobi SVD and Lanczos Algorithms:

Matrix Type: Lanczos is well-suited for large, sparse matrices, especially when eigenvalues and
eigenvectors are of interest.
Jacobi SVD is better suited for dense matrices, and it directly provides the SVD decomposition.
Numerical Stability: Jacobi SVD is generally more numerically stable than Lanczos, making it
suitable for a broader range of applications.
Computation Time: The performance comparison in terms of computation time depends on the
size and sparsity of the matrices. Lanczos might be more efficient for large sparse matrices,
while Jacobi SVD might be more efficient for dense matrices.
Parallelization: Both algorithms actually benefit from parallelization. Lanczos, when used in
eigenvalue problems, exploits parallelism during the iterations. Jacobi SVD is parallelized by
utilizing parallel linear algebra operations.
Memory Requirements: Lanczos often requires less memory as it operates iteratively on the
matrix without explicitly forming the entire matrix.
Jacobi SVD may require more memory, especially for large dense matrices.

Challenges faced in parallelizing Singular Value Decomposition:


Parallelizing Singular Value Decomposition (SVD) poses several challenges due to the
inherently sequential nature of the algorithm and the intricate dependencies between its steps.
Here are some major challenges faced while implementing parallel SVD:
1. Sequential Dependencies: SVD has strong sequential dependencies, where the
computation of singular values and vectors depends on the previous steps. This limits the
opportunities for parallelization, as many operations must be performed in a specific
order.
2. Matrix Factorizations: SVD involves matrix factorizations, such as bidiagonalization or
tridiagonalization, which may not parallelize efficiently. These factorizations often
require sequential processing and can become bottlenecks in the parallelization process.
3. Communication Overhead: In distributed computing environments, communication
between processors or nodes can introduce overhead. SVD algorithms often involve data
exchanges and updates across different parts of the matrix, leading to increased
communication costs.
4. Load Imbalance: Load balancing is a challenge in parallelizing SVD, particularly for
algorithms that involve dividing the matrix into blocks or submatrices. Unevenly
distributed workload across processors can lead to inefficient utilization of resources.
5. Numerical Stability: Maintaining numerical stability is crucial in SVD. Parallelization
may introduce additional rounding errors or synchronization issues, impacting the
accuracy of the results.
6. Memory Bandwidth: SVD computations involve frequent access to memory. In parallel
environments, issues related to memory bandwidth and contention can arise, affecting
overall performance.
7. Parallel Matrix Multiplication: Matrix multiplication is a fundamental operation in SVD,
and parallelizing this operation efficiently is challenging. It requires careful consideration
of data distribution and communication patterns to avoid performance bottlenecks.
8. Dynamic Load Balancing: In dynamic environments where the size of the matrices or the
number of available processors may change, adapting the parallelization strategy
dynamically becomes a challenge.
9. Algorithmic Complexity: Many parallel SVD algorithms have complex algorithms with
multiple iterations and interdependencies. Coordinating these aspects in a parallel
environment can be intricate.
10. Scalability: Achieving good scalability, i.e., maintaining performance as the size of the
problem or the number of processors increases, is a common challenge in parallel SVD
implementations.

Despite these challenges, hybrid approaches, combining multiple parallelization techniques, and
leveraging specialized hardware, such as GPUs, are often employed to address some of these
challenges and enhance the performance of parallel SVD computations.

Future scope of work:


The future scope of work in the field of Singular Value Decomposition (SVD) and related areas
is vast and dynamic, with ongoing research and advancements.
1. Efficient Parallelization Techniques: Continued research into more efficient
parallelization techniques for SVD algorithms, especially for large-scale and sparse
matrices. Enhancing scalability and minimizing communication overhead in distributed
computing environments.
2. Robust Numerical Methods: Developing more robust numerical methods for SVD that
can handle ill-conditioned matrices and numerical stability issues, ensuring accurate
results in a wider range of applications.
3. Parallel Algorithms for Special Matrix Structures: Designing parallel algorithms tailored
for specific matrix structures, such as block-circulant matrices or structured sparse
matrices. Customizing parallelization strategies can optimize performance in specialized
applications.
4. Low-Rank Approximations: Further exploration of low-rank approximations and
randomized algorithms for SVD. Investigating their efficiency in different scenarios and
developing hybrid methods that combine the strengths of traditional and randomized
approaches.
5. Applications in Machine Learning and Data Science: Extending the application of SVD
in machine learning and data science, exploring new ways to leverage SVD for tasks such
as dimensionality reduction, collaborative filtering, and feature extraction.
6. Distributed and Cloud Computing: Researching SVD algorithms suitable for distributed
and cloud computing environments. Addressing challenges related to data distribution,
fault tolerance, and dynamic scaling to accommodate the demands of modern computing
infrastructures.
7. Tensor Decompositions: Expanding research into higher-order tensor decompositions,
which generalize matrix decompositions like SVD to multi-dimensional data. Developing
efficient parallel algorithms for tensor decompositions in various scientific and data
analysis applications.
8. Real-Time Processing: Enhancing SVD algorithms for real-time and streaming data
processing. Investigating methods to adapt SVD computations to evolving data streams
and dynamic environments.

References:
1. https://www.engr.colostate.edu/~hj/conferences/164.pdf: Parallel Algorithms for SIngular
Value Decomposition - Renard R. Ulrey, Anthony A. Maciejewski and Howard Jay
Siegel
2. https://core.ac.uk/download/pdf/81925994.pdf: An overview of parallel algorithms for
the singular value and symmetric eigenvalue problems - Michael Berry and Ahmed
Sameh
3. https://www.math.pku.edu.cn/teachers/xusf/private/dump/parallel%20lanczos.pdf: A
Parallel Implementation of Lanczos Algorithm in SVD Computation - SRIKANTH
KALLURKAR

Team Details:
1. Madhuparna Ghosh(CS21MDS14016)
2. Leela Pavani Dulla(CS22MDS15027)
3. Anurag Roy(CS20MDS14010) - Did not participate due to health issues.

You might also like