cobaltowl

We'll cross that bridge when we find it

Achieving real-time adjacent behaviour in Beaglebones

29-08-2023


Although it might seem like so to many, real time and high performance are not synonymous. Real time operations could take a long time to run, but what we’re aiming for are predictable processing times, under very specific constraints. 

The configurations and tips described in this article may result in significant performance improvements, but this is not its objective. You’ll notice that the standard deviation in iteration times is what we’ll be focusing on, not the average per se. 

Furthermore, a true real time system will yield much better results than what we’re currently doing here. No matter how much we try, we’ll never be able to get perfect deterministic behavior with the main CPU (without giving up on all of the amenities we expect from a fully fledged kernel and implementing our own solution), that’s what the PRU is for.  ##The BeagleBone AI  The BeagleBone AI is a single-board computer sporting dual ARM processors (Cortex A-15), alongside a hefty amount of L3 on-chip cache (2.5 MB) and RAM (1 GB). However, the pièce de résistance of this little board are its 4 processing real time units, running at 200 MHz. 

But what if the 4 PRUs aren’t enough for our project? Well, then we turn to the main CPU. However, it has to deal with tons of other tasks, giving us unpredictable timings. In this article, I’ll detail how to communicate efficiently, minimizing jitter and latency, while still keeping useful kernel functionality. 

Use cases 

This is aimed at cases where part of the communication will be done through the main CPU, while still maintaining a functioning network stack and keeping the system usable at all times, irrespective of the PRU being used or not. 

“Simplify, then add lightness” 

The default Beaglebone image comes with a ton of software we’re not going to use. Not only do they take up space, but they also use a reasonable amount of CPU power. Most of these are related to Cloud9 and NodeRed, so I’ve made a list of packages not normally used: 

apt remove bonescript c9-core-installer nginx nodejs nodejs-doc npm   \
    bb-node-red-installer alsa-utils pastebinit nginx-full nginx-common \
    bone101 mjpg-streamer javascript-common bluealsa

We’ll be left with leftover files taking up a lot of space, so remove them: 

rm -rf /opt/cloud9/ /var/lib/cloud9/

Even then, we’ll have a few services we don’t want, such as the WiFi tether and Bluetooth: 

systemctl disable bb-bbai-tether.service
systemctl disable bb-wl18xx-bluetooth.service

Disable the tether for good by setting TETHER_ENABLED in /etc/default/bb-wl18xx to NO. 

sed -i "s/TETHER_ENABLED=.*/TETHER_ENABLED=no/g" /etc/default/bb-wl18xx

Scheduling

The Linux kernel assigns resources to tasks through scheduling. Scheduling itself can be divided in two categories: conventional and real time. 

Conventional processes can tolerate delays and follow a zen philosophy of “its done when its done”. If the system is too busy, a conventional process might take longer to run, resulting in variable execution times for the same task. 

Real time processes will get maximum “priority” when dealt with by the scheduler. Since real time processes are seen as urgent, delays are not tolerable. This means that if a real time process has to run, and it is already ready to run (in the runqueue), it will make all other conventional processes wait for it. 

Furthermore, real time processes have two policies for stipulating how the scheduler shall handle them: FIFO (first in, first out) and RR (round-robin). 

FIFO scheduling is straightforward: the first process to be ready to run will get assigned resources first, the last one will get assigned resources last. If the process yields the CPU or waits for an IO operation, then other process gets CPU time. 

RR scheduling is slightly more complicated: each process gets a set amount of CPU time, and once a process that time is over, the scheduler moves to the next process. Of course, this also depends on priority, so a process with higher priority will get assigned more CPU time. 

This is why we must be careful with scheduling; hogging too many resources will lag or even freeze the system until the process is finished. 

Given that, we’ll settle for real time scheduling, as we cannot tolerate delays or inconstant timings. Even the maximum priority in conventional scheduling (through nice) isn’t enough to guarantee the behavior we want. 

We can change scheduling for a given task using chrt, for example:  

chrt –r 99 <command>

However, the preferred way is to do it programatically, in C: 

#include <pthread.h>

pthread_t thisThread = pthread_self();
struct sched_param params;
params.sched_priority = sched_get_priority_max(SCHED_FIFO);
pthread_setschedparam(thisThread, SCHED_FIFO, &params);

This will give our process maximum priority and utilize the FIFO RT scheduling policy, however, you can benchmark and check if RR suits your application better, or even if you can make do with a lower priority. 

Real Time Kernel

No, using a kernel with the PREEMPT_RT patch will not automatically turn the Beaglebone into a real time powerhouse, but we do get massive benefits from it. Enabling `PREEMPT_RT`` will tell the kernel to make as much of its operations as preemptible as possible. This means that the kernel will allow itself to be interrupted while executing code. 

In turn, our code won’t necessarily be blocked from running as expected by kernel tasks, resulting in lower standard deviations for execution times, but this will add a bit of complexity as a few problems might arise, such as with locks, leading to complex locking (or lock-less) structures, which might actually raise the average execution time. If you don’t need a RT kernel, don’t use it. 

Thankfully, TI has recently started releasing kernels with the PREEMPT patch, which can be easily installed on the Beaglebone by utilizing the /opt/scripts/tools/update_kernel.sh script. 

To select a RT kernel, you must add the --ti-rt-kernel option when running the script. Just don’t forget to select a kernel version as well. For example: 

/opt/scripts/tools/update_kernel.sh --ti-rt-kernel --lts-5_10

Now, once finished, you should reboot and check that PREEMPT is enabled through uname -a. You should see the word PREEMPT amidst the output of the command. 

We can go further, however. Do you remember how the Beaglebone AI has two processors? At this time, we’ve got too many (this means, at least a single task) tasks running on the same processor we use to run our code, which is bad. We can reserve a processor for our code, and leave the other processor for everything else. But how? 

Isolating CPUs and disabling IRQ balancing

When we isolate a CPU, we’re telling the scheduler that it should not even try to schedule anything on a given processor. We’re going to almost exclusively use this processor for our program, giving us improved performance and even lower jitter. 

This can be done through various ways, from shielding, to cpuset (on /dev/cpuset) or isolcpus. However, even though isolcpus is deprecated in favor of cpuset, it yields better results, so that’s what we’ll be using. 

In order to isolate CPUs, you must add isolcpus to the kernel command line on startup, which requires messing with our bootloader, u-boot. You can add isolcpus to the command line through the /boot/uEnv.txt file, on the line labeled cmdline. Since we’ve got two cores, pick either 0 or 1 and add isolcpus=<core> to the line. 

Before rebooting, if you open htop, you’ll notice that too many tasks are running on our processor. If you don’t see the “CPU” column, you can enable it by pressing F2, going to “Columns” and enabling it. While you’re at it, disable “Hide kernel threads” on the “Display options” tab. 

After you reboot, you’ll be able to see that no non-kernel task is running on our CPU, but this can be improved further. 

We’ll do this by disabling interrupt balancing. Interrupt balancing is the act of balancing interrupts across processors, making multiple processors handle interrupts. This might also affect our performance, so we’re disabling it. In order to do that, we must add more commands to the kernel command line: acpi_irq_nobalance noirqbalance

That will result in a small improvement for our program, but we can go even further. If you check which processor IRQs are assigned to, through /proc/irq/*/smp_affinity, you’ll see that most (if not all) are assigned to either CPU, or even worse, assigned to our CPU directly. 

Checking which IRQ is bound to which CPU is easy: 

  1. CPU 0 
  2. CPU 1 
  3. Either CPU 0 or CPU 1 (bitmask of 2 or 1) 

To change the CPU an IRQ is assigned to, echo either 1 or 2 to /proc/irq/*/smp_affinity. This is a bit tedious, so I’ve made a bash script to automate that. 

Finishing off

Now that everything has been turned on, depending on your application, there should be a considerable improvement. However, it is always worth it to check which parts of your code can be optimized or outright removed.

Don’t forget to benchmark across several optimization levels and different settings, finding what works for you is easier if you’ve got benchmarks to determine what implementation performs the way you need.