Skip to content

The Truth about OLGA Software Speed

Recently, I witnessed a discussion on OLGA Speed in the OLGA User group on LinkedIn. The discussion centers around the question of why OLGA performs almost the same on two different CPUs, an Intel Core i7 processor at 3.4 GHz and an Intel Core i5 processer at 3.4 GHz. The surprising result is troubling, as Core i5 has a significantly lower processor.
I have witnessed flow assurance companies buying expensive hardware to make OLGA run faster. However, these investments can have disastrous results. As a budding flow assurance specialist, I was privy to one of those misses. We discovered that OLGA ran on hardware that was much more expensive than it did on computers that were only one year old. Since then, I’ve spent a lot time researching OLGA speed as well as trying to understand how OLGA performance is affected by various factors.

I wanted to share my knowledge in order for flow assurance companies to make better buying decisions. It would be interesting to add data and analysis and to examine how thread count affects OLGA speed. Torgeir Vanvik of Schlumberger provided great insight into the workings of OLGA. I’m hoping that this post will shed more light on the subject.
Key factors that influence OLGA simulation speed

There are several key factors that impact OLGA simulation speed. Some factors have to do the numerical modeling complexity, while others are dependent on the hardware where OLGA is executed.

On the modeling side of things, the most obvious thing is the complexity and size of the network being modelled. A single branch model runs faster than networks. Simple converging network models are more efficient than those with diverging lines or networks with looped connections. This is not something flow assurance engineers have control over so it’s not worth discussing.

Next up are the section lengths as well as the numerical time step. The parameters MINDT (or MAXDT) in the INTEGRATION specification control the simulation time step. Also, the DTCONTROL parameters control the simulation time. CFL conditions control the simulation time step in order to maintain model stability. CFL is a condition that determines the distance the fluid can travel in one time step. This is relative to how long a section of the model is. The result is that the longer a section is, the longer you can take to complete your time steps. Model speed can be affected by section lengths and the INTEGRATION/DTCONTROL parameters. Model speed is often determined by the section with the lowest number of connections. I could write a complete treatise about this topic, but that is for another day.

The smallest section of the network is often the one that controls the model speed.

Hardware-wise, the main factors that influence simulation speed are I/O speed and CPU speed.
The processor

The clock speed and the number of cores are two important specifications for modern CPUs. The clock speed shows how many instructions per second are processed, while the number of cores indicates how many instructions are being processed in parallel. Modern versions OLGA (6 and higher) can exploit the power and speed of multiple cores while older versions (OLGA 5 and lower) cannot take advantage of multi-core processors.

It doesn’t matter which version of OLGA you have, the clock speed is critical. It comes down ultimately to how many instructions can you process per second. Therefore, the processor’s GHz (the faster the better) is critical.

It doesn’t matter which version of OLGA software you have, the clock speed is vital

For OLGA 6, and later versions, speed will also be affected by the number of cores. But, it’s easy to fall for the false belief that more cores equals faster simulation speeds. Unfortunately, some tasks can be processed in parallel and others do not. Tasks that are split into smaller tasks will take longer to complete if they take more time than the parallel processing time. This means that there are theoretical limits to the benefits of parallelization depending on the problem. This is true also for OLGA.

It takes some research into OLGA parallelizability to answer the practical question, “Is a 3.4 GHz quad-core CPU better than a 2.4 GHz sixteen-core CPU?” This topic is actually explored later in this article.

The theoretical limit on parallelization gains depends on the problem.

I/O

OLGA outputs simulation results as it runs, so the speed at which the results are written can sometimes limit the run-time speed. The hard drive speed when OLGA is saving locally and the network bandwidth when OLGA writing to a local drive are both common hardware bottlenecks.

Most laptops and desktop computers are commercial grade. They come with mechanical hard drives that spin at 5400 rpm or 7200rpm. Server-grade machines usually have 10k or fifteenk rpm drives. The drive’s spin speed directly affects read/write speed. When it comes to OLGA speed, the drive will perform better if the spin speed is greater. Solid state drives (SSDs), are now affordable and can be used commercially. SSDs are slower than mechanical drives but can still be extremely fast, depending on their manufacturer and model. With that said, not all SSDs can be as fast as they claim to be. You should also consider the computer bus interface as it determines internal data transfer speeds. But, these days, that interface is not the bottleneck. In the end, hard drive performance is just as important as the CPU for simulation speed.

In the end, hard drive performance may be as important for simulation speed as the CPU

When saving OLGA results from a network share, it can limit the ability of OLGA write simulations results. Therefore, it is important that companies ensure that the bandwidth between the OLGA-running computer and the network storage server is as large as possible. This will help to reduce OLGA speed drops.

If OLGA saves OLGA results, the network can also restrict OLGA’s ability to write simulations results

These bottlenecks can be avoided by carefully considering the frequency, and amount of outputs from simulations.

Methodology

All models were run with profile outputs and no trend to eliminate I/O effect on parallel speedup. Each model was run several times with a variety of threads in order to guarantee repeatable results. The program could run the model as many times as 20 times in 10 minutes. It also allowed for all models to run at least two times per minute. Each model and each thread combination was then given an average runtime. It is important to note that the simulation iterations had almost identical run times. This study used OLGA 2014.2 (see the acknowledgments at end). This command was used for manipulating the number of threads in OLGA. A thread is a component of a computer programme that can be managed independently by the operating systems. Modern CPUs can handle up to two threads.

All simulations were done on a machine equipped with four cores, capable of running eight threads at a time and capable to run 4 physical cores.
Results

The speedup achieved by various models is shown in the first plot. The ideal speedup graph shows that models with n threads will achieve a faster speedup than those using only one thread. OLGA runs on the number (in this case 4), of cores.

Analysis

The parallel speedup and efficiency plots indicated that there was a variation in the efficiency of parallelization between different model types. Next, we need to know what makes a model more parallelizable.

The main calculation loop for OLGA is the time loop. It tracks the time from the start of the simulation to the end. The initial sequential process would include reading input files, tabfiles, and so on. The final steps of post-processing may include closing file handle, releasing memory, and so on.

With this background in mind we calculated parallel efficiency curves and an exponential function using the following form:

\mu_p=e^c(n_p-1)

Where

“mu_p” is the parallel efficiency.

ctext is the parallel efficiency degradation factor.

n_ptext refers to the number threads.

I call the calculated factor c the parallel efficiency decay number. The decay factor can then be plotted as a function various aspects in the model. We find that the decay factor depends on the runtime of the model and the number sections.

It’s all in all,

We will return to hardware selections and how they impact OLGA speed. Number of cores, clock speed, and I/O speeds are all important factors. OLGA’s latest versions are multi-threaded, and can run faster if it uses multiple processor cores. We conducted a detailed analysis to determine if cores have an impact on OLGA’s speed and whether spending money on cores is a wise decision.

OLGA defaults on using as many cores as threads. We found that the best speedup with just 4 threads was 3, which is a 75% parallel efficiency. In general, speedups are greater for more complex simulations. Multi-threading didn’t help for short simulations. Even in a lengthy simulation with 7000 parts, moving from 4 threads up to 8 threads didn’t increase the speedup to 3 to 4. As we increase the number of threads, our parallel efficiency drops. Based on our analysis we concluded that 4 threads was the optimal number to run flow assurance models in OLGA. While you could tweak this for specific models, I don’t recommend spending too much time.

For running flow assurance models using OLGA, four threads is a good spot.

The OLGA manual recommends that the cores be used for simultaneous simulations, rather than being used to speed up a single simulation. This is in keeping with our findings. But this advice is somewhat naive. Most professional laptops and desktops have 4 cores. However, they lack the hard drive access speed to support 4 simultaneous data simulations. The middle is where the right choice lies.

When buying hardware, 4 cores is the minimum you should consider when choosing a computer with a hard drive. If you already have enough OLGA licenses, and would like to centralize all your simulations on one machine then the storage choice is equally important. OMP_NUM_THREADS environment variable should be set to 4, to allow OLGA to run at maximum parallel efficiency.