Launching your compiled QuEST application can be as straightforward as running any other executable, though some additional steps are needed to make use of hardware acceleration. This page how to launch your own QuEST applications on different platforms, how to run the examples and unit tests, how to make use of multithreading, GPU-acceleration, distribution and supercomputer job schedulers, and monitor the hardware utilisation.

‍TOC:

Examples

Tests

v4

v3

Multithreading

Choosing threads

Monitoring utilisation

Improving performance

GPU-acceleration

Launching

Monitoring

Configuring

Benchmarking

Distribution

Launching

Configuring

Benchmarking

Multi-GPU

Supercomputers

SLURM

PBS

Note: This page assumes you are working in a build directory into which all executables have been compiled.

Examples

‍See compile.md for instructions on compiling the examples.

The example source codes are located in examples/ with structure

examples/
    isolated/
        complex_arithmetic.c
        complex_arithmetic.cpp
        ...
    extended/
        dynamics.c
        dynamics.cpp
        ...
    tutorials/
        min_example.c
        ...

where file.c and file.cpp respectively demo QuEST's C11 and C++14 interfaces. Files are divided between subdirectories:

isolated/ which contains demos of one function or task, typically showcasing all the different syntaxes and interfaces available.
extended/ which contains longer, standalone examples performing common tasks and algorithms in quantum computing.
tutorials/ which contains guides with step-by-step explanations.

These files are compiled into executables of the same name, respectively suffixed with _c or _cpp, and (by default) are saved in subdirectories of build which mimic the structure of examples/. E.g.

build/
    examples/
        isolated/
            complex_arithmetic_c
            complex_arithmetic_cpp
            ...
        extended/
            dynamics_c
            dynamics_cpp
            ...
        tutorials/
            min_example_c

Note

On Windows, the executables are located in \Release\ subdirectories, assuming the parameter --config Release was specified during compilation (see compiled).

Most of these executables can be run directly from within build, e.g.

./examples/extended/dynamics_c

while others require command-line arguments:

./examples/extended/reporting_environments_cpp
 
# output
Must pass single cmd-line argument:
  1 = serial
  2 = multithreaded
  3 = GPU-accelerated
  4 = distributed
  5 = all
  6 = auto

Note: On Windows, the executables are .exe files and can be run with e.g.
\examples\isolated\Release\initialising_paulis_c.exe

Tests

‍See compile.md for instructions on compiling the v4 and v3 unit tests.

v4

QuEST's unit and integration tests are compiled into executable tests within the tests/ subdirectory, and can be directly run from within the build folder via

./tests/tests

which should, after some time, output something like

QuEST execution environment:
  precision:       2
  multithreaded:   1
  distributed:     1
  GPU-accelerated: 1
  cuQuantum:       1
  num nodes:       16
  num qubits:      6
  num qubit perms: 10
 
Tested Qureg deployments:
  GPU + MPI
 
Randomness seeded to: 144665856
===============================================================================
All tests passed (74214 assertions in 240 test cases)

This tests binary accepts all of the Catch2 CLI arguments, for example to run specific tests

./tests/tests applyHadamard

or all tests within certain groups

./tests/tests "[qureg],[matrices]"

or specific test sections and subsections:

./tests/tests -c "validation" -c "matrix uninitialised"

If the tests were compiled with distribution enabled, they can distributed via

mpirun -np 8 ./tests/tests

Alternatively, the tests can be run through CTest within the build directory via either

ctest

make test

which will log each passing test live, outputting something like

Test project /build
        Start   1: calcExpecPauliStr
  1/240 Test   #1: calcExpecPauliStr .............................   Passed   14.03 sec
        Start   2: calcExpecPauliStrSum
  2/240 Test   #2: calcExpecPauliStrSum ..........................   Passed   10.06 sec
        Start   3: calcExpecNonHermitianPauliStrSum
  3/240 Test   #3: calcExpecNonHermitianPauliStrSum ..............   Passed   10.34 sec
        Start   4: calcProbOfBasisState
  4/240 Test   #4: calcProbOfBasisState ..........................   Passed    0.33 sec
        Start   5: calcProbOfQubitOutcome
  5/240 Test   #5: calcProbOfQubitOutcome ........................   Passed    0.12 sec
        Start   6: calcProbOfMultiQubitOutcome
  6/240 Test   #6: calcProbOfMultiQubitOutcome ...................   Passed   15.07 sec
        Start   7: calcProbsOfAllMultiQubitOutcomes
...

Alas tests launched in this way cannot be deployed with distribution.

v3

The deprecated tests, when compiled, can be run from the build directory via

./tests/deprecated/dep_tests

which accepts the same Catch2 CLI arguments as the v4 tests above, and can be distributed the same way.

To launch the tests with CTest, run

cd tests/deprecated

ctest

Attention: The deprecated unit tests are non-comprehensive and the deprecated API should not be relied upon, for it may introduce undetected corner-case bugs. Please only use the deprecated API and tests for assistance porting your application from QuEST v3 to v4.

Multithreading

Note: Parallelising QuEST over multiple cores and CPUs requires first compiling with multithreading enabled, as detailed in compile.md.

Choosing threads

The number of threads to use is decided before launching the compiled executable, using the OMP_NUM_THREADS environment variable.

OMP_NUM_THREADS=32 ./myexec

export OMP_NUM_THREADS=32

./myexec

It is prudent to choose as many threads as your CPU(s) have total hardware threads or cores, which need not be a power of 2. One can view this, and verify the number of available threads at runtime, by calling reportQuESTEnv() which outputs a subsection such as

[cpu]
  numCpuCores.......10 per machine
  numOmpProcs.......10 per machine
  numOmpThrds.......32 per node

Note: When running distributed, variable OMP_NUM_THREADS specifies the number of threads per node and so should ordinarily be the number of hardware threads (or cores) per machine.

Monitoring utilisation

The availability of multithreaded deployment can also be checked at runtime using reportQuESTEnv(), which outputs something like:

[compilation]
  isOmpCompiled...........1
[deployment]
  isOmpEnabled............1

where Omp signifies OpenMP and the two 1 respectively indicate it has been compiled and runtime enabled.

Like all programs, the CPU utilisation of a running QuEST program can be viewed using

OS	Program	Method
Linux	HTOP	Run `htop` in terminal
MacOS	Activity Monitor	Place on dock > right click icon > `Monitors` > `Show CPU usage` (see here)
Windows	Task Manager	`Performance` > `CPU` > right click graph > `Change graph to` > `Logical processors` (see here)

Note however that QuEST will not leverage multithreading at runtime when either:

Your Qureg created with createQureg() was too small to invoke automatic multithreading.
You called initCustomQuESTEnv() and disabled multithreading for all subsequently-created Qureg.

Usage of multithreading can be (inadvisably) forced using createForcedQureg() or createCustomQureg().

Improving performance

Performance may be improved by setting other OpenMP variables. Keep in mind that for large Qureg, QuEST's runtime is dominated by the costs of modifying large memory structures during long, uninterrupted loops: namely the updating of statevector amplitudes. Some sensible settings include

OMP_DYNAMIC =false to disable the costly runtime migration of threads between cores.
OMP_PROC_BIND =spread to (attemptedly) give threads their own caches (see here).

‍Replace this with KMP_AFFINITY on Intel compilers.
OMP_PLACES =threads to allocate each spawned thread to a CPU hardware thread. Alternatively set =cores to assign one thread per core, helpful when the hardware threads interfere (e.g. due to caching conflicts).

OpenMP experts may further benefit from knowing that QuEST's multithreaded source code, confined to cpu_subroutines.cpp, is almost exclusively code similar to

#pragma omp parallel for if(qureg.isMultithreaded)

for (qindex n=0; n<numIts; n++)

#pragma omp parallel for reduction(+:val)
for (qindex n=0; n<numIts; n++)
    val += 

and never specifies schedule nor invokes setters in the runtime library routines. As such, all behaviour can be strongly controlled using environment variables, for example by:

Remarks: Sometimes the memory bandwidth between different sockets of a machine is poor, and it is substantially better to exchange memory in bulk between their NUMA nodes, rather than through repeated random access. In such settings, it can be worthwhile to hybridise multithreading and distribution, even upon a single machine, partitioning same-socket threads into their own MPI node. This forces inter-socket communication to happen in-batch, via message-passing, at the expense of using double total memory (to store buffers). See the distributed section.

GPU-acceleration

Note: Using GPU-acceleration requires first compiling QuEST with CUDA or HIP enabled (to utilise NVIDIA and AMD GPUs respectively) as detailed in compile.md.

Launching

The compiled executable is launched like any other, via

./myexec

Using multiple available GPUs, regardless of whether they are local or distributed, is done through additionally enabling distribution.

Monitoring

To runtime check whether GPU-acceleration was compiled and is being actively utilised, call reportQuESTEnv(). This will display a subsection like

[compilation]
  isGpuCompiled...........1
[deployment]
  isGpuEnabled............1

where the 1 indicate GPU-acceleration was respectively compiled and is runtime available (i.e. QuEST has found suitable GPUs). When this is the case, another section will be displayed detailing the discovered hardware properties, e.g.

[gpu]
  numGpus...........2
  gpuDirect.........0
  gpuMemPools.......1
  gpuMemory.........15.9 GiB per gpu
  gpuMemoryFree.....15.6 GiB per gpu
  gpuCache..........0 bytes per gpu

Utilisation can also be externally monitored using third-party tools:

GPU	Type	Name
NVIDIA	CLI	`nvidia-smi`
NVIDIA	GUI	Nsight
AMD	CLI, GUI	`amdgpu_top`

Note however that GPU-acceleration might not be leveraged at runtime when either:

Your Qureg created with createQureg() was too small to invoke automatic GPU-acceleration.
You called initCustomQuESTEnv() and disabled GPU-acceleration for all subsequently-created Qureg.

Usage of GPU-acceleration can be (inadvisably) forced using createForcedQureg() or createCustomQureg().

Configuring

There are a plethora of environment variables which be used to control the execution on NVIDIA and AMD GPUs. We highlight only some below.

Choose which GPUs among multiple available to permit QuEST to utilise via CUDA_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES.
Alternatively, set the order of selected GPUs (CUDA_DEVICE_ORDER) to FASTEST_FIRST or PCI_BUS_ID.
- In single-GPU mode, this informs which GPU QuEST will use (i.e. the first).
- In multi-GPU mode, this informs which local GPUs are used.

Benchmarking

Beware that the CPU dispatches tasks to the GPU asynchronously. Control flow returns immediately to the CPU, which will proceed to other duties (like dispatching the next several quantum operation's worth of instructions to the GPU) while the GPU undergoes independent computation (goes brrrrr). This has no consequence to the user who uses only the QuEST API, which will automatically synchronise the CPU and GPU when necessary (like inside functions calcTotalProb()).

However, it does mean codes which seeks to benchmark QuEST must be careful to wait for the GPU to be ready before beginning the stopwatch, and wait for the GPU to finish before stopping the stopwatch. This can be done with the syncQuESTEnv(), which incidentally also ensures nodes are synchronised when distributed.

Distribution

Note: Distributing QuEST over multiple machines requires first compiling with distribution enabled, as detailed in compile.md.

Important: Simultaneously using distribution and GPU-acceleration introduces additional considerations detailed in the proceeding section.

Launching

A distributed QuEST executable called myexec can be launched and distributed over (e.g.) 32 nodes using mpirun with the

mpirun -np 32 ./myexec

or on some platforms (such as with Intel and Microsoft MPI):

mpiexec -n 32 myexec.exe

Some supercomputing facilities however may require custom or additional commands, like SLURM's srun command. See an excellent guide here, and the job submission guide below.

srun --nodes=8 --ntasks-per-node=4 --distribution=block:block

Important: QuEST can only be distributed with a power of 2 number of nodes, i.e. 1, 2, 4, 8, 16, ...

Note: When multithreading is also enabled, the environment variable OMP_NUM_THREADS will determine how many threads are used by each node (i.e. each MPI process). Ergo optimally deploying to 8 machines, each with 64 CPUs (a total of 512 CPUs), might resemble:
OMP_NUM_THREADS=64 mpirun -np 8 ./myexec

It is sometimes convenient (mostly for testing) to deploy QuEST across more nodes than there are available machines and sockets, inducing a gratuitous slowdown. Some MPI compilers like OpenMPI forbid this by default, requiring additional commands to permit oversubscription.

mpirun -np 1024 --oversubscribe ./mytests

Configuring

‍TODO:

detail environment variables

Benchmarking

QuEST strives to reduce inter-node communication when performing distributed simulation, which can otherwise dominate runtime. Between these rare communications, nodes work in complete independence and are likely to desynchronise, especially when performing operations with non-uniform loads. In fact, many-controlled quantum gates are skipped by non-participating nodes which would otherwise wait idly!

Nodes will only synchronise when forced by the user (with syncQuESTEnv()), or when awaiting necessary communication (due to functions like calcTotalProb()). Furthermore, Qureg created with createQureg() will automatically disable distribution (and be harmlessly cloned upon every node) when they are too small to outweigh the performance overheads.

This can make monitoring difficult; CPU loads on different nodes can correspond to different stages of execution, and memory loads may fail to distinguish whether a large Qureg is distributed or a small Qureg is duplicated! Further, a node reaching the end of the program and terminating does not indicate the simulation has finished - other desynchronised nodes may still be working.

It is ergo always prudent to explicitly call syncQuESTEnv() immediately before starting and ending a performance timer. This way, the recorded runtime should reflect that of the slowest node (and ergo, the full calculation) rather than that of the node which happened to have its timer output logged.

Multi-GPU

‍TODO:

explain usecases (multi local GPU, multi remote GPU, hybrid)

explain GPUDirect

explain CUDA-aware MPI

explain UCX

detail environment variables

detail controlling local vs distributed gpus with device visibility

‍helpful ARCHER2 snippet:
# Compute the raw process ID for binding to GPU and NIC

lrank=$((SLURM_PROCID % SLURM_NTASKS_PER_NODE))

# Bind the process to the correct GPU and NIC

export CUDA_VISIBLE_DEVICES=${lrank}

export UCX_NET_DEVICES=mlx5_${lrank}:1

Supercomputers

A QuEST executable is launched like any other in supercomputing settings, including when distributed. For convenience however, we offer some example SLURM and PBS job submission scripts to deploy QuEST in various configurations. These examples assume QuEST and the user source have already been compiled, as guided in compile.md.

Note: These submission scripts are only illustrative. It is likely the necessary configuration and commands on your own supercomputing facility differs!

SLURM

4 machines each with 8 CPUs:

#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
OMP_NUM_THREADS=8 srun ./myexec

1 machine with 4 local GPUs:

#SBATCH --nodes=1
#SBATCH --tasks-per-node=4
#SBATCH --gres=gpu:4
#SBATCH --distribution=block:block
#SBATCH --hint=nomultithread
srun ./myexec

1024 machines with 16 local GPUs (divides Qureg between 16384 partitions):

#SBATCH --nodes=1024
#SBATCH --tasks-per-node=16
#SBATCH --gres=gpu:16
#SBATCH --distribution=block:block
#SBATCH --hint=nomultithread
srun ./myexec

PBS

4 machines each with 8 CPUs:

#PBS -l select=4:ncpus=8

OMP_NUM_THREADS=8 aprun -n 4 -d 8 -cc numa_node ./myexec