## Solving Quadratic Equations with XL on Parallel Architectures

### Introduction

Multivariate quadratic systems can be solved using the XL algorithm: The system is extended by multiplying each equation with all monomials up to a certain degree $$D$$; then the resulting system is linearized by treating all monomials as individual variables. The resulting large and sparse linear system $$B$$ is solved; if $$D$$ was chosen large enough, the kernel space becomes small enough so that we find a solution for the original quadratic system.

We are using Coppersmith's block variant of the Wiedemann algorithm (block Wiedemann: BW) for solving the sparse linear system. In the first step of BW, a sequence of matrices $$\{a^{(i)}\}, a_i=x B^{i+1} z$$ for some matrices $$x$$ and $$z$$ is computed. In the second step, the block Berlekamp-Massey algorithm is used in order to compute the minimal polynomial $$\lambda$$ of the sequence. Finally, the polynomial is evaluated at $$B$$ giving a solution $$s$$ such that $$Bs = 0$$.

We describe this project and the implementation in detail in our paper "Solving Quadratic Equations with XL on Parallel Architectures" and in Chapter 4 of my thesis.

We used this code to solve Type III systems of the Fukuoka MQ Challenge. Here is more information on solving the challenges.

### Source Code

The field and the size of the system must be defined at compile time, there are defines (Q, M and N) in the Makefile; currently, there are implementations for $$\mathbb{F}_2,$$ $$\mathbb{F}_{16},$$ and $$\mathbb{F}_{31}$$; the code is in the gf/ directory. After compiling, you can run the program using:

./xl --all

The code generates a random system, prints the expected solution, and then uses XL to do the computation. A Fukuoka challenge can be loaded using

./xl --challenge FILE --all

The top (comment) lines from the challenge must have been removed from the file.

By default, the code compiles with OpenMP; there are several implementations for MPI which you can choose in the Makefile:

• The most useful one is probably "BW1_two_blocks" which splits the workload of the first step of BW into two blocks in order to communicate while computing.
• Background communication works best if you have Infiniband. Using "BW1_two_blocks_ibv", the Infiniband cards are programmed directly using the IB Verbs API, reducing overhead and providing 'real' (CPU less) background data transfer.
• "BW1_one_block" uses one block only and works well if there is no DMA capable network hardware (i.e., there is no real background send anyway).
• "BW1_mpi_size_blocks" scales very well for the first step of BW — but due to the increasing block size, for many MPI nodes the runtime of Berlekamp-Massey goes up (the second step of BW).

### Tweaking BW3

The code implements an unpublished idea to improve the runtime of the last step of BW which is not in the paper: Usually, when solving a large linear system, one is interested in a solution for all variables. However, for XL, we actually need only a solution for the original system (before it was extended and linearized). Therefore, we do not need to repeat the computations of the first BW step — but can just compute the solution from the sequence that was computed in the first step (using a tweak on how the sequence is computed). That reduces the cost of the last step to a marginal amount (e.g., 3 minutes out of 50 days in the Type III n=35 case).