![]() |
The Quantum Exact Simulation Toolkit v4.0.0
|
Launching your compiled QuEST application can be as straightforward as running any other executable, though some additional steps are needed to make use of hardware acceleration. This page how to launch your own QuEST applications on different platforms, how to run the examples and unit tests, how to make use of multithreading, GPU-acceleration, distribution and supercomputer job schedulers, and monitor the hardware utilisation.
TOC:
build
directory into which all executables have been compiled.See
compile.md
for instructions on compiling the examples.
The example source codes are located in examples/
and are divided into subdirectories, e.g.
where file.c
and file.cpp
respectively demonstrate QuEST's C11
and C++14
interfaces. These files are compiled into executables of the same name, respectively prefixed with c_
or cpp_
, and saved in subdirectories of build
which mimic the structure of examples/
. E.g.
Most of these executables can be run directly from within build
, e.g.
while others require command-line arguments:
See
compile.md
for instructions on compiling thev4
andv3
unit tests.
QuEST's unit and integration tests are compiled into executable tests
within the tests/
subdirectory, and can be directly run from within the build
folder via
which should, after some time, output something like
This tests
binary accepts all of the Catch2 CLI arguments, for example to run specific tests
or all tests within certain groups
or specific test sections and subsections:
If the tests were compiled with distribution enabled, they can distributed via
Alternatively, the tests can be run through CTest within the build
directory via either
which will log each passing test live, outputting something like
Alas tests launched in this way cannot be deployed with distribution.
The deprecated tests, when compiled, can be run from the build
directory via
which accepts the same Catch2 CLI arguments as the v4
tests above, and can be distributed the same way.
To launch the tests with CTest, run
compile.md
.The number of threads to use is decided before launching the compiled executable, using the OMP_NUM_THREADS
environment variable.
It is prudent to choose as many threads as your CPU(s) have total hardware threads or cores, which need not be a power of 2
. One can view this, and verify the number of available threads at runtime, by calling reportQuESTEnv()
which outputs a subsection such as
OMP_NUM_THREADS
specifies the number of threads per node and so should ordinarily be the number of hardware threads (or cores) per machine.The availability of multithreaded deployment can also be checked at runtime using reportQuESTEnv()
, which outputs something like:
where Omp
signifies OpenMP and the two 1
respectively indicate it has been compiled and runtime enabled.
Like all programs, the CPU utilisation of a running QuEST program can be viewed using
OS | Program | Method |
---|---|---|
Linux | HTOP | Run htop in terminal |
MacOS | Activity Monitor | Place on dock > right click icon > Monitors > Show CPU usage (see here) |
Windows | Task Manager | Performance > CPU > right click graph > Change graph to > Logical processors (see here) |
Note however that QuEST will not leverage multithreading at runtime when either:
Qureg
created with createQureg()
was too small to invoke automatic multithreading.initCustomQuESTEnv()
and disabled multithreading for all subsequently-created Qureg
.Usage of multithreading can be (inadvisably) forced using createForcedQureg()
or createCustomQureg()
.
Performance may be improved by setting other OpenMP variables. Keep in mind that for large Qureg
, QuEST's runtime is dominated by the costs of modifying large memory structures during long, uninterrupted loops: namely the updating of statevector amplitudes. Some sensible settings include
OMP_DYNAMIC
=false
to disable the costly runtime migration of threads between cores.OMP_PROC_BIND
=spread
to (attemptedly) give threads their own caches (see here). Replace this with
KMP_AFFINITY
on Intel compilers.
OMP_PLACES
=threads
to allocate each spawned thread to a CPU hardware thread. Alternatively set =cores
to assign one thread per core, helpful when the hardware threads interfere (e.g. due to caching conflicts). OpenMP experts may further benefit from knowing that QuEST's multithreaded source code, confined to cpu_subroutines.cpp
, is almost exclusively code similar to
and never specifies schedule
nor invokes setters in the runtime library routines. As such, all behaviour can be strongly controlled using environment variables, for example by:
CUDA
or HIP
enabled (to utilise NVIDIA and AMD GPUs respectively) as detailed in compile.md
.The compiled executable is launched like any other, via
Using multiple available GPUs, regardless of whether they are local or distributed, is done through additionally enabling distribution.
To runtime check whether GPU-acceleration was compiled and is being actively utilised, call reportQuESTEnv()
. This will display a subsection like
where the 1
indicate GPU-acceleration was respectively compiled and is runtime available (i.e. QuEST has found suitable GPUs). When this is the case, another section will be displayed detailing the discovered hardware properties, e.g.
Utilisation can also be externally monitored using third-party tools:
GPU | Type | Name |
---|---|---|
NVIDIA | CLI | nvidia-smi |
NVIDIA | GUI | Nsight |
AMD | CLI, GUI | amdgpu_top |
Note however that GPU-acceleration might not be leveraged at runtime when either:
Qureg
created with createQureg()
was too small to invoke automatic GPU-acceleration.initCustomQuESTEnv()
and disabled GPU-acceleration for all subsequently-created Qureg
.Usage of GPU-acceleration can be (inadvisably) forced using createForcedQureg()
or createCustomQureg()
.
There are a plethora of environment variables which be used to control the execution on NVIDIA and AMD GPUs. We highlight only some below.
CUDA_VISIBLE_DEVICES
and ROCR_VISIBLE_DEVICES
.CUDA_DEVICE_ORDER
) to FASTEST_FIRST
or PCI_BUS_ID
.Beware that the CPU dispatches tasks to the GPU asynchronously. Control flow returns immediately to the CPU, which will proceed to other duties (like dispatching the next several quantum operation's worth of instructions to the GPU) while the GPU undergoes independent computation (goes brrrrr). This has no consequence to the user who uses only the QuEST API, which will automatically synchronise the CPU and GPU when necessary (like inside functions calcTotalProb()
).
However, it does mean codes which seeks to benchmark QuEST must be careful to wait for the GPU to be ready before beginning the stopwatch, and wait for the GPU to finish before stopping the stopwatch. This can be done with the syncQuESTEnv()
, which incidentally also ensures nodes are synchronised when distributed.
compile.md
.A distributed QuEST executable called myexec
can be launched and distributed over (e.g.) 32
nodes using mpirun
with the
or on some platforms (such as with Intel and Microsoft MPI):
Some supercomputing facilities however may require custom or additional commands, like SLURM's srun
command. See an excellent guide here, and the job submission guide below.
2
number of nodes, i.e. 1
, 2
, 4
, 8
, 16
, ...OMP_NUM_THREADS
will determine how many threads are used by each node (i.e. each MPI process). Ergo optimally deploying to 8
machines, each with 64
CPUs (a total of 512
CPUs), might resemble: It is sometimes convenient (mostly for testing) to deploy QuEST across more nodes than there are available machines and sockets, inducing a gratuitous slowdown. Some MPI compilers like OpenMPI forbid this by default, requiring additional commands to permit oversubscription.
TODO:
- detail environment variables
QuEST strives to reduce inter-node communication when performing distributed simulation, which can otherwise dominate runtime. Between these rare communications, nodes work in complete independence and are likely to desynchronise, especially when performing operations with non-uniform loads. In fact, many-controlled quantum gates are skipped by non-participating nodes which would otherwise wait idly!
Nodes will only synchronise when forced by the user (with syncQuESTEnv()
), or when awaiting necessary communication (due to functions like calcTotalProb()
). Furthermore, Qureg
created with createQureg()
will automatically disable distribution (and be harmlessly cloned upon every node) when they are too small to outweigh the performance overheads.
This can make monitoring difficult; CPU loads on different nodes can correspond to different stages of execution, and memory loads may fail to distinguish whether a large Qureg
is distributed or a small Qureg
is duplicated! Further, a node reaching the end of the program and terminating does not indicate the simulation has finished - other desynchronised nodes may still be working.
It is ergo always prudent to explicitly call syncQuESTEnv()
immediately before starting and ending a performance timer. This way, the recorded runtime should reflect that of the slowest node (and ergo, the full calculation) rather than that of the node which happened to have its timer output logged.
TODO:
- explain usecases (multi local GPU, multi remote GPU, hybrid)
- explain GPUDirect
- explain CUDA-aware MPI
- explain UCX
- detail environment variables
- detail controlling local vs distributed gpus with device visibility
helpful ARCHER2 snippet:
# Compute the raw process ID for binding to GPU and NIClrank=$((SLURM_PROCID % SLURM_NTASKS_PER_NODE))# Bind the process to the correct GPU and NICexport CUDA_VISIBLE_DEVICES=${lrank}export UCX_NET_DEVICES=mlx5_${lrank}:1
A QuEST executable is launched like any other in supercomputing settings, including when distributed. For convenience however, we offer some example SLURM and PBS job submission scripts to deploy QuEST in various configurations. These examples assume QuEST and the user source have already been compiled, as guided in compile.md
.
4 machines each with 8 CPUs:
1 machine with 4 local GPUs:
1024 machines with 16 local GPUs (divides Qureg
between 16384 partitions):
4 machines each with 8 CPUs: