How I Adjudicate Tangled Terminal States Using D-Wave Hardware
D-Wave’s hardware is not easy to understand or use properly. I have trouble understanding how to get it to do what I want, and I have a PhD in physics and was the CTO of the company for more than a decade!
I feel for normal people trying to figure out how to use these things. But man they are so cool. Using them must be like using the early computers in the 1950s when no one knew what was going on. Exciting!
In this post, I want to show you what I did to get the system to do what I want.
Some Background
D-Wave processors can be thought of as physical graphs, where the vertices are qubits and the edges are couplers between pairs of qubits.
They are a type of neuromorphic processor, physically structured like a neural network, where the qubits play the role of neurons and the couplers play the role of connections between neurons.
They are also thermodynamic processors, in that they are designed to return samples from a probability distribution.
A typical D-Wave processor is a moderately large graph, with thousands of vertices and tens of thousands of edges. I’ll call the hardware graph the target graph.
Problems we want to solve using this type of system are usually also naturally defined on graphs. For example in Tangled, the problems are defined on the game graph. We call the graph defining the problem we want to solve the source graph.
The Algorithm I Settled On For Using D-Wave Hardware
After a lot of poking and prodding, I’ll start with the punchline, and show you what I settled on for how to use the hardware. After this, I’ll explain each of the steps, including what they are and how they might affect solution quality.
Algorithm 1
Given: a source game graph E, a hardware target graph G, and integer parameters {M, N, S}:
Offline: find a set of P non-overlapping embeddings of E into G; find a set of automorphisms (vertex permutation symmetries) A of E
Online:
Store all fully disconnected vertices (if any) in E in a list L
Repeat M times:
randomly select an automorphism a from A; embed multiple non-overlapping copies of a into hardware using the embeddings P
(optional) randomly select a gauge across all qubits in the chip
(optional) iteratively shim the linear terms S times to get close to <sⱼ>=0 for all j across the chip
collect N samples
For each disconnected vertex in L, for all MxPxN samples associated with that vertex, replace hardware samples with +1 or -1 with 50% probability
Analyze all MxPxN samples, determine score
Offline Step 1: Find a Set of P Non-Overlapping Embeddings of E into G
If the source graph E is much smaller than the target graph G, there will be many ways to simultaneously ‘fit’ the source into the target. I went over how to do this (and why!) in the post on multiple parallel embeddings.
The tl;dr on ‘why’ is (a) you get more samples per hardware call by parallelizing in this way and (b) by using more physical qubits you average over certain sources of noise.
My process found P=343 simultaneous parallel embeddings of the three-vertex source graph into the D-Wave Advantage2.5 target graph, using a total of 343*3=1,029 physical qubits and the same number of physical couplers (all the other physical couplers on the chip are set to zero so that the different embeddings don’t ‘talk’ to each other).
This step can be done offline once per source graph, with the results stored.
Offline Step 2: Find a Set of Automorphisms (Vertex Permutation Symmetries) A of E
Source graphs often have symmetries, like rotation or reflection. We can find these symmetries using the Python networkx graph package.
For the three-vertex complete source graph E, there are six of these: A = [{0: 0, 1: 1, 2: 2}, {0: 0, 2: 1, 1: 2}, {1: 0, 0: 1, 2: 2}, {1: 0, 2: 1, 0: 2}, {2: 0, 0: 1, 1: 2}, {2: 0, 1: 1, 0: 2}]. The first in the list is just the original graph; the second swaps vertices 1 and 2; the third swaps vertices 0 and 1; the fourth rotates the graph clockwise; the fifth rotates the graph counterclockwise; and the sixth swaps vertices 0 and 2.
From the perspective of pure math, each of these graphs represents the same object. However since the target graph is an actual physical thing, it will not return the exact same answers for each of these six variants.
For this small graph, the number of these symmetries is small (just six!). In the general case, there could be an enormous number of them depending on the source graph. For example, the D-Wave Zephyr (1,4) graph (a subgraph of the hardware graph) has so many automorphisms that the process I use to find them ran out of RAM (128GB) without completing!
So in this step, we generate a set A of some (if possible all) of them, and store this set for later use.
The Online Algorithm
Once we have pre-computed P and A, we are ready to run the online solver algorithm.
The first thing I do is make a list of any fully disconnected vertices (vertices that have no non-zero couplers attached to them). The reason for this is that we already know exactly what the sampling statistics of these things are — 50% for each of the two states — so we don’t need to ask a quantum computer for the answer! Note that I still send the full graph (including the disconnected vertices) to hardware, but in a post-processing step I replace all the hardware samples with a probability 50% sample from +1/-1 for all the samples associated with these disconnected vertices.
After this, we loop over four steps (two of them optional) M times.
Let’s look at these four steps.
Online Step 1: Randomly Select an Automorphism a From A; Embed Multiple Non-Overlapping Copies of a Into Hardware Using The Embeddings P
We pre-computed a set A of automorphisms; the first step inside the solver loop is to randomly select one, and then embed that choice P times into the hardware graph using our pre-computed embedding. In the image below I’ve tried to graphically depict this step, using a randomly chosen automorphism of E (the three-vertex graph).
(Optional) Online Step 2: Randomly Select a Gauge Across All Qubits in the Chip
The Tangled Hamiltonian is the Ising Hamiltonian with all single variable bias terms set to zero and the specific settings of the quadratic terms encoding the problem to solve. We can write the Ising Hamiltonian like
A trick we can use here is to notice that if I relabel one of the spins (say the mᵗʰ spin) so that what I used to call up (sₘ=+1) becomes down (s’ₘ=-1) and what I used to call down (sₘ=-1) I call up (s’ₘ=+1), and I flip the sign of all the J values connected to that spin (J’ₘₖ -> -Jₘₖ and J’ₖₘ -> -Jₖₘ for all k), then the new Hamiltonian is unchanged. This kind of symmetry in the Ising Hamiltonian is called a gauge symmetry. We can use any gauge we like and nothing (at least from the perspective of math) changes.
If we do a gauge transform, the values of J we send to hardware to evaluate will change, but we interpret the resultant answer by flipping the semantics of the relabeled spin.
Why would we do this? It’s the same reason we did the automorphisms — changing the semantics of the spin labels averages over certain kinds of noise. There’s a good write-up of this here if you’d like to dive deeper into this.
The reason I listed this step as optional is that it’s effectively the same strategy as the automorphism thing. The reason why you might want to use gauge transforms instead of automorphisms might be that there just aren’t many (or even more than one!) for highly asymmetric source graphs, and if the source graph is large enough, you might not get the multiple embeddings averaging effect, whereas there are always guaranteed to be many gauge transforms. For example if I had a highly asymmetric graph that was so big I could only fit it into hardware once, I would then use multiple gauge transforms to average over noise.
For a V vertex graph, there are 2ⱽ gauges. For our V=3 graph, that’s only 8, but the exponential here means that in general we won’t be able to use them all, and we’ll have to only use some subset of them. The way we are going to apply the gauge transformations is to follow the prescription the D-Wave people use as their default — flip the qubit direction semantics with 50% probability for each qubit in a processor.
In our case, since we use multiple embeddings of the same source graph into the target graph, this means that each embedded source graph gets one of the possible gauges with probability 1/2ⱽ. For our tiny V=3 problem, there are only 8 gauges to choose from and each hardware call embeds the source into the target 343 times so even one gauge transform will usually sample over all possible gauges. But as V gets bigger, the ‘coverage’ over all gauges will go to zero. For example, a problem with V=48 has 2⁴⁸ ~ 10¹⁴ equivalent gauges, so we have to just sample over some subset of these.
Here’s a depiction of the situation after we’ve ‘flipped’ each qubit in the processor with 50% probability, or equivalently randomly selected one of the above gauges for each of the 343 copies of the problem on the chip (following the cyan arrow to the cyan box).
(Optional) Online Step 3: Iteratively Shim The Linear Terms S Times to Get Close to <sⱼ>=0 For All j Across The Chip
OK so now we have a problem specification we’re ready to send to the hardware (the thing in the cyan box in the above image)! But hold the phone, there’s another type of noise we can try to mitigate. Let me describe the issue, and how to compensate for it.
In the cases of interest in Tangled, the Hamiltonian is symmetric under flipping the sign of all the spins. This means that for any state, there is another state with the same energy but with all the spins reversed. This in turn means that the average value of each qubit over many measurements <sⱼ> should be zero for all qubits. This is a powerful fact that doesn’t depend on the problem instance.
The idea of shimming (see also here) is to modify terms that D-Wave calls flux bias offsets on a qubit-by-qubit basis to try to compensate for systematic offsets. Flux bias offsets are related to but not the same as the hⱼ terms in the Hamiltonian (see the above link for an explanation of the difference). They provide a way to add or subtract biases to each of the qubits independently.
The bias shimming idea is simple. We run the problem in the cyan box in hardware, collect some samples, and compute <sⱼ> for all qubits j. We then compute flux bias offsets fⱼ = fⱼ - α<sⱼ>, where α is typically around 1e-5. We then apply these offsets to the problem specification and run the problem again, collecting new samples. We then compute <sⱼ> again, recompute the flux bias offsets, and repeat this whole process for S iterations. By doing this we should iteratively shrink the absolute value of <sⱼ> and get closer and closer to <sⱼ>=0.
Note that this shimming process has to be redone each time a problem instance (cyan box) is ready to be sent to hardware, as the linear noise terms have complicated origins and change based on things like crosstalk.
Online Step 4: Collect N Samples
We now collect N samples from hardware and store them.
Finally, Post-Process
We run through the above four steps M times, giving a total of MxNxP samples over the original problem (as an example, we could use M=2, N=4, P=343 giving 2,744 samples per problem instance run on hardware). From these samples, we can calculate the correlation matrix, influence vector, and the Tangled score associated with the problem instance.
Results From a Set of Test Runs
To test this algorithm, I’ll focus on two problem instances. Instance 1 is when all the qubits are ferromagnetically locked together, and instance 2 is where there are two ferromagnetic couplers and one antiferromagnetic coupler (see image below). I chose these because instance 1 is really easy to score (score is 0 as all vertices have the same influence) and instance 2 is one of the orneriest ones for this graph (score is -2/3 and is sensitive to sampling rates of all six ground states).
As an aside, these instances are also instances of the type of problem D-Wave exposes in the shimming tutorial above — ferromagnetic loops, and ferromagnetic loops with one anti-ferromagnetic kink. The latter are kind of interesting quantum mechanically because they have a lot of degenerate ground states.
Exact Correlation Matrices for Both
We can just write down by inspection what the correlation matrices are for both of these. From these we can compute all the other stuff we need for Tangled (influence vectors, score, and therefore who won the game). These are:
Computing Score Statistics For Both Instances
My intention here is to run a bunch of experiments to test how different parts of the above algorithm affect the score returned by the quantum computer.
I fix values of parameters {M, N, S} such that MxN is some fixed number (so the number of samples per run is the same — here I chose MxN=8, with specific pairs (M=8, N=1),(M=4, N=2), (M=2, N=4), (M=1, N=8)), and run Algorithm 1 20 times with a set of choices of parameters where this is true, both with and without online step 3 (i.e. with or without a gauge transformation step). I chose S=1 and S=10. This gives a total of 16 runs with different parameters, with 20 scores per run.
For each run, I compute the score and then plot the result as a histogram for each choice of parameters.
Note that the choice S=1 means no shimming. M=1 means only using one automorphism. The case with no gauge transformation (G: False in the images, G: True means use the gauge transform), S=1, and M=1 is the ‘naive’ way to use the hardware (although still using parallel embeddings) by just sending one variant of the problem embedded P times in the hardware graph and YOLO’ing it.
Conclusions For Use For The Three-Vertex Graph
It appears, at least from the analysis of these two instances, that the best way to use to hardware here is to use M=1, no gauge transform, and no shim, and make N as large as you can afford.
My best guess is that for this tiny graph, the massive number of parallel embeddings (P=343) is doing most of the averaging over noise part, and adding the rest of these mechanisms doesn’t help much. I think though that taking more samples (N larger) should help as the variance in scores between runs in the images above should go down as N goes up.