D-Wave’s hardware is not easy to understand or use properly. I have trouble understanding how to get it to do what I want, and I have a PhD in physics and was the CTO of the company for more than a decade!

I feel for normal people trying to figure out how to use these things. But man they are so cool. Using them must be like using the early computers in the 1950s when no one knew what was going on. Exciting!

In this post, I want to show you what I did to get the system to do what I want.

Some Background

D-Wave processors can be thought of as physical graphs, where the vertices are qubits and the edges are couplers between pairs of qubits.

They are a type of neuromorphic processor, physically structured like a neural network, where the qubits play the role of neurons and the couplers play the role of connections between neurons.

They are also thermodynamic processors, in that they are designed to return samples from a probability distribution.

A typical D-Wave processor is a moderately large graph, with thousands of vertices and tens of thousands of edges. I’ll call the hardware graph the target graph.

Problems we want to solve using this type of system are usually also naturally defined on graphs. For example in Tangled, the problems are defined on the game graph. We call the graph defining the problem we want to solve the source graph.

In what follows here, I’ll use the very simple three-vertex source graph shown here. But the process I’m describing applies to any source graph in the same way.

In Tangled, this graph represents the game state [0,0,0,1,2,3] (no vertices chosen by either player; for the couplers, grey=1 is zero coupling, green=2 is ferromagnetic, and purple=3 is antiferromagnetic).

The Algorithm I Settled On For Using D-Wave Hardware

After a lot of poking and prodding, I’ll start with the punchline, and show you what I settled on for how to use the hardware. After this, I’ll explain each of the steps, including what they are and how they might affect solution quality.

Algorithm 1

Given: a source game graph E, a hardware target graph G, and integer parameters {M, N, S}:

Offline: find a set of P non-overlapping embeddings of E into G; find a set of automorphisms (vertex permutation symmetries) A of E

Online:

Store all fully disconnected vertices (if any) in E in a list L

Repeat M times:

randomly select an automorphism a from A; embed multiple non-overlapping copies of a into hardware using the embeddings P
(optional) randomly select a gauge across all qubits in the chip
(optional) iteratively shim the linear terms S times to get close to <sⱼ>=0 for all j across the chip
collect N samples

For each disconnected vertex in L, for all MxPxN samples associated with that vertex, replace hardware samples with +1 or -1 with 50% probability

Analyze all MxPxN samples, determine score

Offline Step 1: Find a Set of P Non-Overlapping Embeddings of E into G

If the source graph E is much smaller than the target graph G, there will be many ways to simultaneously ‘fit’ the source into the target. I went over how to do this (and why!) in the post on multiple parallel embeddings.

The tl;dr on ‘why’ is (a) you get more samples per hardware call by parallelizing in this way and (b) by using more physical qubits you average over certain sources of noise.

My process found P=343 simultaneous parallel embeddings of the three-vertex source graph into the D-Wave Advantage2.5 target graph, using a total of 343*3=1,029 physical qubits and the same number of physical couplers (all the other physical couplers on the chip are set to zero so that the different embeddings don’t ‘talk’ to each other).

This step can be done offline once per source graph, with the results stored.

Offline Step 2: Find a Set of Automorphisms (Vertex Permutation Symmetries) A of E

Source graphs often have symmetries, like rotation or reflection. We can find these symmetries using the Python networkx graph package.

For the three-vertex complete source graph E, there are six of these: A = [{0: 0, 1: 1, 2: 2}, {0: 0, 2: 1, 1: 2}, {1: 0, 0: 1, 2: 2}, {1: 0, 2: 1, 0: 2}, {2: 0, 0: 1, 1: 2}, {2: 0, 1: 1, 0: 2}]. The first in the list is just the original graph; the second swaps vertices 1 and 2; the third swaps vertices 0 and 1; the fourth rotates the graph clockwise; the fifth rotates the graph counterclockwise; and the sixth swaps vertices 0 and 2.

From the perspective of pure math, each of these graphs represents the same object. However since the target graph is an actual physical thing, it will not return the exact same answers for each of these six variants.

For this small graph, the number of these symmetries is small (just six!). In the general case, there could be an enormous number of them depending on the source graph. For example, the D-Wave Zephyr (1,4) graph (a subgraph of the hardware graph) has so many automorphisms that the process I use to find them ran out of RAM (128GB) without completing!

So in this step, we generate a set A of some (if possible all) of them, and store this set for later use.

The Online Algorithm

Once we have pre-computed P and A, we are ready to run the online solver algorithm.

The first thing I do is make a list of any fully disconnected vertices (vertices that have no non-zero couplers attached to them). The reason for this is that we already know exactly what the sampling statistics of these things are — 50% for each of the two states — so we don’t need to ask a quantum computer for the answer! Note that I still send the full graph (including the disconnected vertices) to hardware, but in a post-processing step I replace all the hardware samples with a probability 50% sample from +1/-1 for all the samples associated with these disconnected vertices.

After this, we loop over four steps (two of them optional) M times.

Let’s look at these four steps.

Online Step 1: Randomly Select an Automorphism a From A; Embed Multiple Non-Overlapping Copies of a Into Hardware Using The Embeddings P

We pre-computed a set A of automorphisms; the first step inside the solver loop is to randomly select one, and then embed that choice P times into the hardware graph using our pre-computed embedding. In the image below I’ve tried to graphically depict this step, using a randomly chosen automorphism of E (the three-vertex graph).

This depicts a randomly chosen automorphism of the problem instance we are solving (in this case the vertex permutation {1:0, 0:1, 2:2} embedded P times into hardware.

(Optional) Online Step 2: Randomly Select a Gauge Across All Qubits in the Chip

The Tangled Hamiltonian is the Ising Hamiltonian with all single variable bias terms set to zero and the specific settings of the quadratic terms encoding the problem to solve. We can write the Ising Hamiltonian like

A trick we can use here is to notice that if I relabel one of the spins (say the mᵗʰ spin) so that what I used to call up (sₘ=+1) becomes down (s’ₘ=-1) and what I used to call down (sₘ=-1) I call up (s’ₘ=+1), and I flip the sign of all the J values connected to that spin (J’ₘₖ -> -Jₘₖ and J’ₖₘ -> -Jₖₘ for all k), then the new Hamiltonian is unchanged. This kind of symmetry in the Ising Hamiltonian is called a gauge symmetry. We can use any gauge we like and nothing (at least from the perspective of math) changes.

If we do a gauge transform, the values of J we send to hardware to evaluate will change, but we interpret the resultant answer by flipping the semantics of the relabeled spin.

Why would we do this? It’s the same reason we did the automorphisms — changing the semantics of the spin labels averages over certain kinds of noise. There’s a good write-up of this here if you’d like to dive deeper into this.

The reason I listed this step as optional is that it’s effectively the same strategy as the automorphism thing. The reason why you might want to use gauge transforms instead of automorphisms might be that there just aren’t many (or even more than one!) for highly asymmetric source graphs, and if the source graph is large enough, you might not get the multiple embeddings averaging effect, whereas there are always guaranteed to be many gauge transforms. For example if I had a highly asymmetric graph that was so big I could only fit it into hardware once, I would then use multiple gauge transforms to average over noise.

For a V vertex graph, there are 2ⱽ gauges. For our V=3 graph, that’s only 8, but the exponential here means that in general we won’t be able to use them all, and we’ll have to only use some subset of them. The way we are going to apply the gauge transformations is to follow the prescription the D-Wave people use as their default — flip the qubit direction semantics with 50% probability for each qubit in a processor.

In our case, since we use multiple embeddings of the same source graph into the target graph, this means that each embedded source graph gets one of the possible gauges with probability 1/2ⱽ. For our tiny V=3 problem, there are only 8 gauges to choose from and each hardware call embeds the source into the target 343 times so even one gauge transform will usually sample over all possible gauges. But as V gets bigger, the ‘coverage’ over all gauges will go to zero. For example, a problem with V=48 has 2⁴⁸ ~ 10¹⁴ equivalent gauges, so we have to just sample over some subset of these.

For our V=3 source graph, there are 2³=8 spin reversal gauges, shown here for the automorphism {1:0, 0:1, 2:2}. Black circles mean we ‘flipped’ what +1 and -1 mean for that vertex, and to preserve the symmetry we have to flip the coupler colors for all couplers connected to that vertex. You can see for this problem instance, there are only four different settings of couplers that are send to hardware, but in general this grows exponentially with coupler count. All of these are mathematically the same problem instance.

Here’s a depiction of the situation after we’ve ‘flipped’ each qubit in the processor with 50% probability, or equivalently randomly selected one of the above gauges for each of the 343 copies of the problem on the chip (following the cyan arrow to the cyan box).

After we’ve randomly ‘flipped’ each qubit with 50% probability and changed the J values appropriately, we have the situation in the cyan box where P copies of the original automorphism of the problem instance are each turned into one of the possible gauge transformed equivalent versions. Remember that each of these copies is the same problem instance mathematically, but physically they are different because of noise and analog stuff going on in the chip.

(Optional) Online Step 3: Iteratively Shim The Linear Terms S Times to Get Close to <sⱼ>=0 For All j Across The Chip

OK so now we have a problem specification we’re ready to send to the hardware (the thing in the cyan box in the above image)! But hold the phone, there’s another type of noise we can try to mitigate. Let me describe the issue, and how to compensate for it.

In the cases of interest in Tangled, the Hamiltonian is symmetric under flipping the sign of all the spins. This means that for any state, there is another state with the same energy but with all the spins reversed. This in turn means that the average value of each qubit over many measurements <sⱼ> should be zero for all qubits. This is a powerful fact that doesn’t depend on the problem instance.

The idea of shimming (see also here) is to modify terms that D-Wave calls flux bias offsets on a qubit-by-qubit basis to try to compensate for systematic offsets. Flux bias offsets are related to but not the same as the hⱼ terms in the Hamiltonian (see the above link for an explanation of the difference). They provide a way to add or subtract biases to each of the qubits independently.

The bias shimming idea is simple. We run the problem in the cyan box in hardware, collect some samples, and compute <sⱼ> for all qubits j. We then compute flux bias offsets fⱼ = fⱼ - α<sⱼ>, where α is typically around 1e-5. We then apply these offsets to the problem specification and run the problem again, collecting new samples. We then compute <sⱼ> again, recompute the flux bias offsets, and repeat this whole process for S iterations. By doing this we should iteratively shrink the absolute value of <sⱼ> and get closer and closer to <sⱼ>=0.

Note that this shimming process has to be redone each time a problem instance (cyan box) is ready to be sent to hardware, as the linear noise terms have complicated origins and change based on things like crosstalk.

This is an example of the effect of shimming on the magnetizations of the qubits. This was obtained during a run on Instance 2 (defined below). Orange shows ten different values of the sum over the absolute values of all the qubit magnetizations normalized to 1 over ten runs without shimming, and the blue shows the same quantity but for the last 10 runs after shimming for S=40 runs. There’s a clear positive effect in balancing the qubits from doing the shim. Shimming unfortunately is computationally costly, roughly increasing the wallclock time by a factor proportional to S.

Online Step 4: Collect N Samples

We now collect N samples from hardware and store them.

Finally, Post-Process

We run through the above four steps M times, giving a total of MxNxP samples over the original problem (as an example, we could use M=2, N=4, P=343 giving 2,744 samples per problem instance run on hardware). From these samples, we can calculate the correlation matrix, influence vector, and the Tangled score associated with the problem instance.

Results From a Set of Test Runs

To test this algorithm, I’ll focus on two problem instances. Instance 1 is when all the qubits are ferromagnetically locked together, and instance 2 is where there are two ferromagnetic couplers and one antiferromagnetic coupler (see image below). I chose these because instance 1 is really easy to score (score is 0 as all vertices have the same influence) and instance 2 is one of the orneriest ones for this graph (score is -2/3 and is sensitive to sampling rates of all six ground states).

As an aside, these instances are also instances of the type of problem D-Wave exposes in the shimming tutorial above — ferromagnetic loops, and ferromagnetic loops with one anti-ferromagnetic kink. The latter are kind of interesting quantum mechanically because they have a lot of degenerate ground states.

Instance 1 has two ground states ((1,1,1) and (-1,-1,-1)). Instance 2 has six ground states (all possible states except (1,1,-1) and (-1,-1,1)). For both, player 1 (red) selects vertex 1 and player 2 (blue) selects vertex 2.

Exact Correlation Matrices for Both

We can just write down by inspection what the correlation matrices are for both of these. From these we can compute all the other stuff we need for Tangled (influence vectors, score, and therefore who won the game). These are:

For instance 1, the correlation between any two spins is always 1 because they are all locked in the same direction. The influence vector is (2,2,2), and the score is 2-2 = 0.

For instance 2, it’s a little harder to see, but if you assume equal occupation of all six ground states this is what the correlation matrix looks like. The influence vector is (0,0,2/3) and the score is 0 - 2/3 = -2/3.

Computing Score Statistics For Both Instances

My intention here is to run a bunch of experiments to test how different parts of the above algorithm affect the score returned by the quantum computer.

I fix values of parameters {M, N, S} such that MxN is some fixed number (so the number of samples per run is the same — here I chose MxN=8, with specific pairs (M=8, N=1),(M=4, N=2), (M=2, N=4), (M=1, N=8)), and run Algorithm 1 20 times with a set of choices of parameters where this is true, both with and without online step 3 (i.e. with or without a gauge transformation step). I chose S=1 and S=10. This gives a total of 16 runs with different parameters, with 20 scores per run.

For each run, I compute the score and then plot the result as a histogram for each choice of parameters.

Note that the choice S=1 means no shimming. M=1 means only using one automorphism. The case with no gauge transformation (G: False in the images, G: True means use the gauge transform), S=1, and M=1 is the ‘naive’ way to use the hardware (although still using parallel embeddings) by just sending one variant of the problem embedded P times in the hardware graph and YOLO’ing it.

**Scores for 20 runs for each image / choice of parameters for Instance 1.** The green dashed line shows the exact score (0). The obvious weird thing here is that with no shimming and no gauge transforms the scores are all perfect, whereas for all the others there are small deviations (note the score x-axis is fairly blown up here). I’m not sure why this is… but it seems to be a real effect. One potential cause is that the instances with gauge transforms introduce anti-ferromagnetic couplers (recall Instance 1 has all ferromagnetic couplers), which could be weaker in practice than the ferromagnetic ones by a tiny bit. Also, the shimming process could potentially bias qubits ***away*** from balancing a small number of times (I’m only doing 10 shim steps so this could happen). But it is a bit disturbing, might have to keep this in mind later on if something weird happens. It looks like all the score distributions are balanced around the exact score.

Same as the previous image, but for Instance 2 where the exact score is -2/3. The width of the score distributions is about an order of magnitude larger, and the weird perfection for no shim no gauge transform goes away (this is consistent with the anti-ferromagnetic couplers potentially being worse than the ferromagnetic ones; in this Instance there are anti-ferromagnetic couplers around all the time). While the sample size here isn’t huge, eyeballing this it looks like none of my fancy stuff helped much in this case. The naive use of the hardware performs the same as all of the others. For these very small instances that get embedded in many places in the chip, the noise cancellation from just the embedding is probably doing most of the work, and the other stuff doesn’t add much. This could change for larger instances! Also again just eyeballing it, but each of these distributions looks balanced around the exact score (averaging over all 20 runs looks very close to the exact score for all of these parameter choices). It could be that the spread in runs is mostly due to the extremely low values of N I used here.

Conclusions For Use For The Three-Vertex Graph

It appears, at least from the analysis of these two instances, that the best way to use to hardware here is to use M=1, no gauge transform, and no shim, and make N as large as you can afford.

My best guess is that for this tiny graph, the massive number of parallel embeddings (P=343) is doing most of the averaging over noise part, and adding the rest of these mechanisms doesn’t help much. I think though that taking more samples (N larger) should help as the variance in scores between runs in the images above should go down as N goes up.

How I Adjudicate Tangled Terminal States Using D-Wave Hardware

Some Background