# Parallel Computing with FEniCS

- Post by: gourav agrawal
- July 23, 2020
- No Comment

Parallel computing refers to breaking down a larger problem into independent smaller parts that can be executed simultaneously by multiple processors. The result is generated by combining results from all processors. It saves a lot of time and money compared to serial computing, where only one instruction is executed at any moment of time. In the field of computational mechanics, we need to solve matrices of the order million X million. If we try to solve such a problem with serial computing, we need huge RAM. A thumb rule to calculate the required RAM size is (Total degree of freedom/10000)^2 GB. So even for a million degree of freedom, we will need approximately 10000 GB RAM. This requirement can further dropdown if we use a sparse matrix that stores only non-zero elements. However, the RAM required to perform this calculation is still high. Such types of calculations can be possible with parallel computing. By increasing the core size, we can solve this problem at relatively less time.

Here we cannot increase the number of cores indefinitely. After a certain core, speed starts to decrease.

There are two kinds of costs associated with any algorithm.

- Computational cost: We can reduce this by parallelization if the logic is parallelizable
- Communication cost: With an increasing number of parallel processes, the communication cost increases.

Before performing any analysis on any workstation, we should have an idea about the number of degrees of freedom per core that the station can handle efficiently.

So, I performed a small parametric study to determine the optimum number of cores required for solving a computational problem with a certain number of elements.

This study consists of solving a simply supported beam problem using FEniCS with Python on a workstation, with 24 cores and 128 GB RAM. I used the parallelization in python with MPI. For more details about how to use MPI and parallelization, you can follow the link https://computationalmechanics.in/parallelizing-for-loop-in-python-with-mpi/

I run the code by varying the number of elements and the number of cores. The number of elements can be varied by redefining the mesh. I started with 100X100 (20000 triangular elements) and uniformly increased it to 3300 X 3300 (21.78 million triangular elements). Number of cores can be changed in the terminal by the command

```
mpirun -np 24 simplysupported.py
```

24 represents the number of cores and followed by the name of the code file. Time elapsed for solving the analysis is found out by the python command time.time() .

It was found that for element less than Half million, time taken by 6-8 cores is relatively lesser than that of other cores. When the number of elements are in the range of .5 to 20 million, 10-12 cores give efficient results. For 21 million elements, 22 cores are efficient. I am not able to perform analysis for elements of more than 21 million due to CPU constraints. The above relation has been inferred based on Figure 1

In figure 2, colorful lines show the number of elements ( in million ). It can be noticed that the impact of an increase in the number of cores for 0.08 million elements (blue line) is not significant but it is significant for the high number of the elements in terms of time. So the use of high core is actually beneficial when we use elements more than .72 million.

Figure 3 shows that as we increase the number of cores, the advantage of the high core can be taken only up to a certain core after that curve is becoming asymptotic. So there is no advantage in increasing the cores. Hence we can perform the analysis at lesser cores and can save resources and time. Point to be noted here is that for less number of elements (less than 20 million), High core numbers are not efficient as communication cost is more than computational cost.

So By doing the same study on a CPU having more processing power, we can validate these results and optimize the code by selecting the optimum number of cores to run the analysis. In real problems, where an analysis code runs for 20-30 days, this consideration can be proved significant.

**Tagged:**FEniCS, HPC, parallel processing