Lattice QCD is Easy to Parallelize

Regular grid of lattice points in spacetime

Sparse matrix inversion

Blue lines with arrows represent 3 x 3 complex matrices, i.e., "links"

Green dots are 3-component complex vectors, i.e., "matter fields"

Regular communication pattern

• Boundary values to neighboring nodes (domain decomposition)

• Global sums for dot products

Spacetime has four dimensions, however, for simplicity we only show neighbors in two dimensions.

Assume L4 sites/node.

Data stored so that at each site there are link matrices to connect the current site and neighboring sites in positive directions, and there is the vector matter field. (A link is pointing in a positive direction, is stored at its tail end site.) A link pointing in a negative direction, is the adjoint of the matrix with the link pointing the opposite way. That means it is actually stored at its head end.

Thus, in the diagram below, the red links and the black vectors are stored at different sites and the vectors must be gathered to the central site. On the other hand, each blue link and the corresponding green vector are stored together, but the temporary result vector stored at that site will have to be moved to the central site to accumulate all the contributions to the final result at the central site.

Because of the domain decomposition, whenever a result has to be gathered or scattered the values for sites at the edges of the domain that must be moved to another domain will put into a message and sent to the node that owns the other domain.

There is an even-odd, or red-black decomposition of the problem, so that if the central site is even, then the neighboring sites at the ends of the arrows are odd.

Let's assume that the central site is even. The strategy to overlap communication and computation is to start to gather the vectors from the odd sites, i.e, the black dots at the end of the red arrows. While those messages are passed, we start the local computation of multiplying the blue links by the green vectors for all odd sites. (This computation must be done for all four directions of links.) Once these local computations are done, we start sending the results which are at odd sites to the even sites at which they are needed. Before we can start the second stage of the computation, we must wait until the first set of messages have arrived (the black dots). We then start the second stage of the computation multiplying the red links at even sites by the vectors that have just arrived. Then, we wait until the matrix-vector products from the first stage of the computations have arrived and accumulate all the results.

We summarize the tasks during the first stage in the three steps below, and calculate the time for each step assuming the bandwidth for passing messages is MB and the rate for matrix times vector is MF. Note that MB and MF are achieved rates, not maximum rates.

a) For every odd site, M × V  in each negative direction
b) For every even site on (+) boundary, receive a vector
c) For every odd site on (-) boundary, send a vector

 ta = $L4/ 2 × 66 Flops × 4/MF$ = $132 L4Flop/ MF$ (microsec) tb = L3 / 2 × 24 bytes × 4 / MB = 48 L3 bytes / MB (microsec) tc = tb

To completely overlap computation and communication, require that ta = tb

132 L4 / MF = 48L3 / MB     or

MB = 48 MF/ (132 L) = 0.364 MF / L

If the communications is not fast enough you must wait for message to arrive. We are assuming full duplex communication, so that steps b and c can proceed at the same time.

This is a simplification!