Domain decomposition

This section describes the Proof of Concept implementation for speeding up code execution in AeoLiS using domain decomposition and parallel programming.

Overview

When domains get larger (more gridpoints), the calculation slows down. Profiling AeoLiS reveals that solving the sparse system of equations (Ax=b) accounts for approximaely 50% of the total code execution time for running a simulation. This operation involves calling the below mentioned Scipy routine that operates on the model as one single domain.

Ct_i += scipy.sparse.linalg.spsolve(A, yCt_i.flatten())

We started to epxlore the use of “domain decomposition” to reduce the complexity (calculation time) of solving the sparse system of equations by dividing the large sparse sysyem equations into several smaller systems that can be computed in parallel.

Implementation details

  • Code changes for implementing domain decomposition and parallel programming in AeoLiS are in the branch parallel-update.

  • Issue tracker: Issue #115

  • Domain decomposition in AeoLis is implemented by dividing a large domain into several smaller overlapping sub domains and computing them in parallel. The parallel computation of the subdomains is facilitated by a pool of Python processes created using the multiprocessing module in Python. The results of the parallel computations are collected and combined to get the final result.

  • Simulation in parallel mode is enabled via a keyword parallel in the configuration file (aeolis.txt). When the keyword is set to T (True), the parallel implementation is enabled.

#create a pool of processes
pools = Pool(5)

#divide the domain into smaller subdomains

if p['parallell'] == True:
    d1 = np.int((np.floor(yCt_i.shape[0]*1/4)+1)*yCt_i.shape[1])
    d2 = np.int((np.floor(yCt_i.shape[0]*1/4))*yCt_i.shape[1])
    d3 = np.int((np.floor(yCt_i.shape[0]*2/4)+1)*yCt_i.shape[1])
    d4 = np.int((np.floor(yCt_i.shape[0]*2/4))*yCt_i.shape[1])
    d5 = np.int((np.floor(yCt_i.shape[0]*3/4)+1)*yCt_i.shape[1])
    d6 = np.int((np.floor(yCt_i.shape[0]*3/4))*yCt_i.shape[1])
# Solve the system of equations in parallel. Distribute the smaller computations among the pool of processes.

results = pools.starmap(scipy.sparse.linalg.spsolve,
                        [(A[0:d1,0:d1], yCt_i.flatten()[0:d1]),
                        (A[d2:d3,d2:d3], yCt_i.flatten()[d2:d3]),
                        (A[d4:d5,d4:d5], yCt_i.flatten()[d4:d5]),
                        (A[d6:,d6:], yCt_i.flatten()[d6:])
                        ])


# Collect the results from the pool of processes and combine them to get the final result.

Ct_i[0:d1] += results[0]
Ct_i[d2:d3] += results[1]
Ct_i[d4:d5] += results[2]
Ct_i[d6:] += results[3]

Experiments and Results

The parallel verison of AeoLiS is benchmarked for calculation time on the sandengine example with 2 different grid sizes. The simulation is run on a laptop computer with 4 cores and 8GB RAM.

The results are shown in the table below.

Table 1 Benchmark results for parallel version of AeoLiS

Configuration

Sequential version

Parallel version

Speed-up achieved

Sandengine with small grids

7 mins 2 secs

4 mins 23 secs

1.6X

Sandengine with large grids

1 hr 07 mins 30 secs

17 mins 17 secs

3.9X

Key findings

  • The speed-up achieved by using domain decomposition and parallel programming is significant for a big domain with a high resultion (many gridpoints). For smaller domains (less gridpoints), the overhead of distributing the subdomains among the available cores on the computer and collecting the results overshadows the overall speed-up achieved.

  • The parallel implementation fails the test that verifies the content of the output netCDF file generated by the parallel version against a reference output netCDF file generated by the sequential version of the code for the same configuration.

Reflections and To do

  • Further testing of the parallel implementiaion need to be reconsidered. Existing tests may not be applicable for the parallel version due to the overlap of the subdomains.

  • It is currently unclear if and under what circomstances the domain decompositions leads to additional inaccuracies. Validation of the output at places where domains connect is needed. If innacuracies emerge this could be a reason to introduce additional irerations. This would slow down the calculation.

  • The proof of concept implementation uses a hardcoded approach to divide the big domain into smaller subdomains. This approach is not generic and may not work for all use cases. Replacing the hardcoded approach with a generalized function to decompose the domains based on the domain properties could be a possible improvement, but this needs to be investigated further.

  • Running parallell calculations may be offered as a specialised (hidden) feature to prevent confusion by inexperienced users.

Try it yourself: Running AeoLiS in parallel mode

You can run AeoLiS in parallel mode by following the below steps:

  1. Clone the aeolis-python GitHub repository and checkout the parallel-update branch.

  2. Install AeoLiS in development mode by running the command pip install -e . from the root directory of the repository.

  3. Navigate to the folder examples/sandmotor_large_grids/ and run the command aeolis aeolis.txt to run the sandmotor simulation with large grids. The keyword parallel in the configuration file (aeolis.txt) is set to T (True) by default, so the simulation will run in parallel mode. This can also be confirmed by looking at the print statements in the terminal.

  4. You may not have sufficient memory available on your computer to run the sandengine large grids example. In that case, try running the sandmotor example with smaller grid size. Navigate to the folder examples/sandmotor_small_grids/ and run the command aeolis aeolis.txt to run the sandmotor simulation with small grids.

  5. If you would like to run AeoLiS in parallel mode for your own example, set the keyword parallel to T (True) in the configuration file.