| Know
Your Problem
Is
it amenable to a parallel solution? Example:
Lattice QCD
![]() Blue lines with arrows represent 3 x 3 complex matrices, i.e., "links" Green dots are 3-component complex vectors, i.e., "matter fields"
M × V
M = 18 reals = 72 bytes
1.45 bytes input
0.36 bytes output
The
heart of conjugate gradient is (sparse matrix) x vector. To parallelize,
use domain decomposition. The system has four dimensions: three space and time.
However, for simplicity, the diagrams will only show neighbors in two dimensions.
Assume there are L4 sites/node. The data is stored so that at each
site there are link matrices to connect the current site and neighboring sites in
positive directions, and there is the vector matter field. In other words, if
a link is pointing in a positive direction, it is stored at its tail end site.
If a link is drawn with
its arrow in a negative direction, it is the adjoint of the matrix with the link
pointing the opposite way. That means it is actually stored at its head end.
Thus, in the diagram below, the red links and the black vectors are stored
at different sites and the vectors must be gathered to the central site. On the
other hand, each blue link and the corresponding green vector are stored
together, but the temporary result vector stored at that site will have
to be moved to
the central site to accumulate all the contributions to the final result at the
central site.
Because of the domain decomposition, whenever a result has to be gathered or
scattered the values for sites at the edges of the domain that must be moved to
another domain will put into a message and sent to the node that owns the other
domain.
There is an even-odd, or red-black decomposition of the problem, so that
if the central site is even, then the neighboring sites at the ends of the
arrows are odd.
![]() Let's assume that the central site is even. The strategy to overlap communication and computation is to start to gather the vectors from the odd sites, i.e, the black dots at the end of the red arrows. While those messages are passed, we start the local computation of multiplying the blue links by the green vectors for all odd sites. (This computation must be done for all four directions of links.) Once these local computations are done, we start sending the results which are at odd sites to the even sites at which they are needed. Before we can start the second stage of the computation, we must wait until the first set of messages have arrived (the black dots). We then start the second stage of the computation multiplying the red links at even sites by the vectors that have just arrived. Then, we wait until the matrix-vector products from the first stage of the computations have arrived and accumulate all the results. We summarize the tasks during the first stage in the three steps below, and calculate the time for each step assuming the bandwidth for passing messages is MB and the rate for matrix times vector is MF. Note that MB and MF are achieved rates, not maximum rates.
a) For every odd site, M × V in each negative direction
To completely overlap computation and communication, require that ta = tb MB = 48 MF/ (132 L) = 0.364 MF / L
If the
communications is not fast enough you must wait for message to arrive.
We are assuming full duplex communication, so that steps b and c can proceed
at the same time.
This is a simplification!
|