The Quantum Exact Simulation Toolkit v4.0.0
Loading...
Searching...
No Matches
🚀  Launching

Launching your compiled QuEST application can be as straightforward as running any other executable, though some additional steps are needed to make use of hardware acceleration. This page how to launch your own QuEST applications on different platforms, how to run the examples and unit tests, how to make use of multithreading, GPU-acceleration, distribution and supercomputer job schedulers, and monitor the hardware utilisation.

TOC:

Note
This page assumes you are working in a build directory into which all executables have been compiled.

Examples

‍See compile.md for instructions on compiling the examples.

The example source codes are located in examples/ and are divided into subdirectories, e.g.

examples/
krausmaps/
initialisation.c
initialisation.cpp
reporters/
env.c
env.cpp
matrices.c
matrices.cpp
...

where file.c and file.cpp respectively demonstrate QuEST's C11 and C++14 interfaces. These files are compiled into executables of the same name, respectively prefixed with c_ or cpp_, and saved in subdirectories of build which mimic the structure of examples/. E.g.

build/
examples/
krausmaps/
c_initialisation
cpp_initialisation
reporters/
c_env
cpp_env
c_matrices
cpp_matrices
...

Most of these executables can be run directly from within build, e.g.

./examples/reporters/cpp_paulis

while others require command-line arguments:

./examples/reporters/c_env
# output
Must pass single cmd-line argument:
1 = serial
2 = multithreaded
3 = GPU-accelerated
4 = distributed
5 = all
6 = auto

Tests

‍See compile.md for instructions on compiling the v4 and v3 unit tests.

v4

QuEST's unit and integration tests are compiled into executable tests within the tests/ subdirectory, and can be directly run from within the build folder via

./tests/tests

which should, after some time, output something like

QuEST execution environment:
precision: 2
multithreaded: 1
distributed: 1
GPU-accelerated: 1
cuQuantum: 1
num nodes: 16
num qubits: 6
num qubit perms: 10
Tested Qureg deployments:
GPU + MPI
Randomness seeded to: 144665856
===============================================================================
All tests passed (74214 assertions in 240 test cases)

This tests binary accepts all of the Catch2 CLI arguments, for example to run specific tests

./tests/tests applyHadamard

or all tests within certain groups

./tests/tests "[qureg],[matrices]"

or specific test sections and subsections:

./tests/tests -c "validation" -c "matrix uninitialised"

If the tests were compiled with distribution enabled, they can distributed via

mpirun -np 8 ./tests/tests

Alternatively, the tests can be run through CTest within the build directory via either

ctest
make test

which will log each passing test live, outputting something like

Test project /build
Start 1: calcExpecPauliStr
1/240 Test #1: calcExpecPauliStr ............................. Passed 14.03 sec
Start 2: calcExpecPauliStrSum
2/240 Test #2: calcExpecPauliStrSum .......................... Passed 10.06 sec
Start 3: calcExpecNonHermitianPauliStrSum
3/240 Test #3: calcExpecNonHermitianPauliStrSum .............. Passed 10.34 sec
Start 4: calcProbOfBasisState
4/240 Test #4: calcProbOfBasisState .......................... Passed 0.33 sec
Start 5: calcProbOfQubitOutcome
5/240 Test #5: calcProbOfQubitOutcome ........................ Passed 0.12 sec
Start 6: calcProbOfMultiQubitOutcome
6/240 Test #6: calcProbOfMultiQubitOutcome ................... Passed 15.07 sec
Start 7: calcProbsOfAllMultiQubitOutcomes
...

Alas tests launched in this way cannot be deployed with distribution.

v3

The deprecated tests, when compiled, can be run from the build directory via

./tests/deprecated/dep_tests

which accepts the same Catch2 CLI arguments as the v4 tests above, and can be distributed the same way.

To launch the tests with CTest, run

cd tests/deprecated
ctest
Attention
The deprecated unit tests are non-comprehensive and the deprecated API should not be relied upon, for it may introduce undetected corner-case bugs. Please only use the deprecated API and tests for assistance porting your application from QuEST v3 to v4.

Multithreading

Note
Parallelising QuEST over multiple cores and CPUs requires first compiling with multithreading enabled, as detailed in compile.md.

Choosing threads

The number of threads to use is decided before launching the compiled executable, using the OMP_NUM_THREADS environment variable.

OMP_NUM_THREADS=32 ./myexec
export OMP_NUM_THREADS=32
./myexec

It is prudent to choose as many threads as your CPU(s) have total hardware threads or cores, which need not be a power of 2. One can view this, and verify the number of available threads at runtime, by calling reportQuESTEnv() which outputs a subsection such as

[cpu]
numCpuCores.......10 per machine
numOmpProcs.......10 per machine
numOmpThrds.......32 per node
Note
When running distributed, variable OMP_NUM_THREADS specifies the number of threads per node and so should ordinarily be the number of hardware threads (or cores) per machine.

Monitoring utilisation

The availability of multithreaded deployment can also be checked at runtime using reportQuESTEnv(), which outputs something like:

[compilation]
isOmpCompiled...........1
[deployment]
isOmpEnabled............1

where Omp signifies OpenMP and the two 1 respectively indicate it has been compiled and runtime enabled.

Like all programs, the CPU utilisation of a running QuEST program can be viewed using

OS Program Method
Linux HTOP Run htop in terminal
MacOS Activity Monitor Place on dock > right click icon > Monitors > Show CPU usage (see here)
Windows Task Manager Performance > CPU > right click graph > Change graph to > Logical processors (see here)

Note however that QuEST will not leverage multithreading at runtime when either:

Usage of multithreading can be (inadvisably) forced using createForcedQureg() or createCustomQureg().

Improving performance

Performance may be improved by setting other OpenMP variables. Keep in mind that for large Qureg, QuEST's runtime is dominated by the costs of modifying large memory structures during long, uninterrupted loops: namely the updating of statevector amplitudes. Some sensible settings include

  • OMP_DYNAMIC =false to disable the costly runtime migration of threads between cores.
  • OMP_PROC_BIND =spread to (attemptedly) give threads their own caches (see here).

    ‍Replace this with KMP_AFFINITY on Intel compilers.

  • OMP_PLACES =threads to allocate each spawned thread to a CPU hardware thread. Alternatively set =cores to assign one thread per core, helpful when the hardware threads interfere (e.g. due to caching conflicts).

OpenMP experts may further benefit from knowing that QuEST's multithreaded source code, confined to cpu_subroutines.cpp, is almost exclusively code similar to

#pragma omp parallel for if(qureg.isMultithreaded)
for (qindex n=0; n<numIts; n++)
#pragma omp parallel for reduction(+:val)
for (qindex n=0; n<numIts; n++)
val +=

and never specifies schedule nor invokes setters in the runtime library routines. As such, all behaviour can be strongly controlled using environment variables, for example by:

Remarks
Sometimes the memory bandwidth between different sockets of a machine is poor, and it is substantially better to exchange memory in bulk between their NUMA nodes, rather than through repeated random access. In such settings, it can be worthwhile to hybridise multithreading and distribution, even upon a single machine, partitioning same-socket threads into their own MPI node. This forces inter-socket communication to happen in-batch, via message-passing, at the expense of using double total memory (to store buffers). See the distributed section.

GPU-acceleration

Note
Using GPU-acceleration requires first compiling QuEST with CUDA or HIP enabled (to utilise NVIDIA and AMD GPUs respectively) as detailed in compile.md.

Launching

The compiled executable is launched like any other, via

./myexec

Using multiple available GPUs, regardless of whether they are local or distributed, is done through additionally enabling distribution.

Monitoring

To runtime check whether GPU-acceleration was compiled and is being actively utilised, call reportQuESTEnv(). This will display a subsection like

[compilation]
isGpuCompiled...........1
[deployment]
isGpuEnabled............1

where the 1 indicate GPU-acceleration was respectively compiled and is runtime available (i.e. QuEST has found suitable GPUs). When this is the case, another section will be displayed detailing the discovered hardware properties, e.g.

[gpu]
numGpus...........2
gpuDirect.........0
gpuMemPools.......1
gpuMemory.........15.9 GiB per gpu
gpuMemoryFree.....15.6 GiB per gpu
gpuCache..........0 bytes per gpu

Utilisation can also be externally monitored using third-party tools:

GPU Type Name
NVIDIA CLI nvidia-smi
NVIDIA GUI Nsight
AMD CLI, GUI amdgpu_top

Note however that GPU-acceleration might not be leveraged at runtime when either:

Usage of GPU-acceleration can be (inadvisably) forced using createForcedQureg() or createCustomQureg().

Configuring

There are a plethora of environment variables which be used to control the execution on NVIDIA and AMD GPUs. We highlight only some below.

  • Choose which GPUs among multiple available to permit QuEST to utilise via CUDA_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES.
  • Alternatively, set the order of selected GPUs (CUDA_DEVICE_ORDER) to FASTEST_FIRST or PCI_BUS_ID.
    • In single-GPU mode, this informs which GPU QuEST will use (i.e. the first).
    • In multi-GPU mode, this informs which local GPUs are used.

Benchmarking

Beware that the CPU dispatches tasks to the GPU asynchronously. Control flow returns immediately to the CPU, which will proceed to other duties (like dispatching the next several quantum operation's worth of instructions to the GPU) while the GPU undergoes independent computation (goes brrrrr). This has no consequence to the user who uses only the QuEST API, which will automatically synchronise the CPU and GPU when necessary (like inside functions calcTotalProb()).

However, it does mean codes which seeks to benchmark QuEST must be careful to wait for the GPU to be ready before beginning the stopwatch, and wait for the GPU to finish before stopping the stopwatch. This can be done with the syncQuESTEnv(), which incidentally also ensures nodes are synchronised when distributed.


Distribution

Note
Distributing QuEST over multiple machines requires first compiling with distribution enabled, as detailed in compile.md.
Important
Simultaneously using distribution and GPU-acceleration introduces additional considerations detailed in the proceeding section.

Launching

A distributed QuEST executable called myexec can be launched and distributed over (e.g.) 32 nodes using mpirun with the

mpirun -np 32 ./myexec

or on some platforms (such as with Intel and Microsoft MPI):

mpiexec -n 32 myexec.exe

Some supercomputing facilities however may require custom or additional commands, like SLURM's srun command. See an excellent guide here, and the job submission guide below.

srun --nodes=8 --ntasks-per-node=4 --distribution=block:block
Important
QuEST can only be distributed with a power of 2 number of nodes, i.e. 1, 2, 4, 8, 16, ...
Note
When multithreading is also enabled, the environment variable OMP_NUM_THREADS will determine how many threads are used by each node (i.e. each MPI process). Ergo optimally deploying to 8 machines, each with 64 CPUs (a total of 512 CPUs), might resemble:
OMP_NUM_THREADS=64 mpirun -np 8 ./myexec

It is sometimes convenient (mostly for testing) to deploy QuEST across more nodes than there are available machines and sockets, inducing a gratuitous slowdown. Some MPI compilers like OpenMPI forbid this by default, requiring additional commands to permit oversubscription.

mpirun -np 1024 --oversubscribe ./mytests

Configuring

‍TODO:

  • detail environment variables

Benchmarking

QuEST strives to reduce inter-node communication when performing distributed simulation, which can otherwise dominate runtime. Between these rare communications, nodes work in complete independence and are likely to desynchronise, especially when performing operations with non-uniform loads. In fact, many-controlled quantum gates are skipped by non-participating nodes which would otherwise wait idly!

Nodes will only synchronise when forced by the user (with syncQuESTEnv()), or when awaiting necessary communication (due to functions like calcTotalProb()). Furthermore, Qureg created with createQureg() will automatically disable distribution (and be harmlessly cloned upon every node) when they are too small to outweigh the performance overheads.

This can make monitoring difficult; CPU loads on different nodes can correspond to different stages of execution, and memory loads may fail to distinguish whether a large Qureg is distributed or a small Qureg is duplicated! Further, a node reaching the end of the program and terminating does not indicate the simulation has finished - other desynchronised nodes may still be working.

It is ergo always prudent to explicitly call syncQuESTEnv() immediately before starting and ending a performance timer. This way, the recorded runtime should reflect that of the slowest node (and ergo, the full calculation) rather than that of the node which happened to have its timer output logged.


Multi-GPU

‍TODO:

  • explain usecases (multi local GPU, multi remote GPU, hybrid)
  • explain GPUDirect
  • explain CUDA-aware MPI
  • explain UCX
  • detail environment variables
  • detail controlling local vs distributed gpus with device visibility

‍helpful ARCHER2 snippet:

# Compute the raw process ID for binding to GPU and NIC
lrank=$((SLURM_PROCID % SLURM_NTASKS_PER_NODE))
# Bind the process to the correct GPU and NIC
export CUDA_VISIBLE_DEVICES=${lrank}
export UCX_NET_DEVICES=mlx5_${lrank}:1

Supercomputers

A QuEST executable is launched like any other in supercomputing settings, including when distributed. For convenience however, we offer some example SLURM and PBS job submission scripts to deploy QuEST in various configurations. These examples assume QuEST and the user source have already been compiled, as guided in compile.md.

Note
These submission scripts are only illustrative. It is likely the necessary configuration and commands on your own supercomputing facility differs!

SLURM

4 machines each with 8 CPUs:

#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
OMP_NUM_THREADS=8 mpirun ./myexec

1 machine with 4 local GPUs:

#SBATCH --nodes=1
#SBATCH --tasks-per-node=4
#SBATCH --gres=gpu:4
#SBATCH --distribution=block:block
#SBATCH --hint=nomultithread
srun ./myexec

1024 machines with 16 local GPUs (divides Qureg between 16384 partitions):

#SBATCH --nodes=1024
#SBATCH --tasks-per-node=16
#SBATCH --gres=gpu:16
#SBATCH --distribution=block:block
#SBATCH --hint=nomultithread
srun ./myexec

PBS

4 machines each with 8 CPUs:

#PBS -l select=4:ncpus=8
OMP_NUM_THREADS=8 aprun -n 4 -d 8 -cc numa_node ./myexec