ROCoding #4: ARM Wrestling

Saturday, April 4, 2026

[Edit 2026-04-18: Fixed incorrect byte ordering of colour channels.]
It’s been 25 years — blimey! — since I did any serious ARM coding. The last was fixing some bugs in my Lisp interpreter in 2001, after which my Iyonix died and I moved over to Linux.

Back with RISC OS now, and in the intervening years the ARM processor has had some substantial changes. Originally it was the Acorn RISC Machine, of course, and it first appeared in the Acorn Archimedes computer in 1987. It was a genuine breakthrough at the time, a custom-designed (by Sophie Wilson et al) 32-bit processor running at 8MHz. I’m now using a 4té2, a repackaged Raspberry Pi 4b containing an ARM Cortex-A72, running at 1.8GHz — over 200 times faster.

And while the original ARM chips were indeed Reduced Instruction Set Computers, with only about 25 instructions1, these days it’s something of a misnomer. So what’s been added? SIMD and NEON, mostly. This article is a simple introduction to using some SIMD instructions; we’ll cover NEON2 later.

The Dwarf Mini smart scope — see other posts — saves image files as FITS files. This is a standardised format for storing scientific and astronomical images and has a few peculiarities, as I found when I started investigating it.

First, the pixel data is encoded as a stream of signed 16-bit twos-complement integers (other numeric formats are possible, but that’s what I’m dealing with here). Why signed? I don’t know, but that’s the only format for 16-bit integers in the FITS specification. In order to get unsigned values we need to add a constant to each number, which is specified in the FITS header — in this case 32768. So we’re mapping the range -32768…32767 to 0…65535.

Second, FITS stores numbers the wrong way round (with respect to ARM processors, anyway), with the most significant byte first: big-endian, not little-endian.

It’s quite possible to cope with this in BASIC, though a bit of a faff. And slow. Which was what prompted me to investigate the new (to me) ARM features available. And the good people who look after RISC OS have ensured that BBC BASIC’s assembler has been updated to handle the new opcodes. To emphasise: in order to run these code fragments — the simple surrounding BASIC code to set up the assembler is left as an exercise — you’ll need to be running a recent RISC OS on a recent computer, with a recent BASIC. In BASIC, type HELP [ to get a list of ARM instructions — if it includes a load of interesting-sounding assembler mnemonics like UXTAB16, SXTH and UQASX you’re good to go.

SIMD

The ARM processor has 32-bit, 4-byte registers, and instructions usually operate on the whole 32-bit word. So add r0,r1,r2 adds r1 and r2 together, and puts the result in r0. SIMD stands for Single Instruction, Multiple Data, and lets us operate on only parts of a word. So, for example, you can perform four 8-bit additions in a single operation. It’s a kind of parallel processing, which can be very useful for dealing with images — remember that a full-colour pixel is a single word containing red, blue and green components in three of the four bytes.

To get the FITS data in a usable form, we first need to invert the byte order of each 16-bit word. If we read a whole word of FITS data into register r1 we’ll have the values for two pixels, and the SIMD ARM instruction rev16 will do exactly what we need:

.hreverse
; half-word byte reversal
;  IN - r1 = pair of half-words
; OUT - r0 = half-word bytes reversed
rev16 r0,r1
mov   pc,lk

Calling machine code from BASIC initialises registers r0 to r7 with whatever is in the variables A% to H%. So we set the BASIC variable B% to some value, &03B2C1D0 here, and call this code with USR, which returns r0. The single instruction swaps the byte order of each of the 16-bit words in r1, and puts the result in r0 which is then returned. For example:

Original: &03B2 C1D0
Reversed: &B203 D0C1

I’ve separated the 16-bit half-words with a space to make it clearer. Actually, ARM processors can now operate with big-endian data, but changing that globally makes me very nervous…

Now we need to add in the correction to get unsigned numbers; in other words, we normalise the data. Again, ARM can, er, lend a hand: the uadd16 instruction adds corresponding unsigned half-words together:

.hnormalise
; Half-word normalisation
;  IN - r1 = pair of half-words
;       r2 = offsets to normalise
; OUT - r0 = half-words added to the offsets
uadd16 r0,r1,r2
mov    pc,lk

Each half-word in r1 will be added to the corresponding half-word in r2. So with various values in B%, and &80008000 in C% (32768 in hex, in each half-word) we get:

Original  : &8000 8020
Normalised: &0000 0020

Original  : &7FFF 1234
Normalised: &FFFF 9234

Of course, you’d normally amalgamate these routines (or even define a macro — it’s only two instructions!):

.hrevnormalise
; Half-word byte-order correction and normalisation
;  IN - r1 = pair of half-words
;       r2 = offsets to normalise
; OUT - r0 = half-words reversed, and added to the offsets
rev16  r0,r1
uadd16 r0,r0,r2
mov    pc,lk

What’s very important to realise is that the operations on each half-word are entirely separate. If the sum exceeds the maximum value storable in the lower 16-bit word (65535), it won’t affect the bits in the other half-word; it just wraps round. Also note that these SIMD instructions don’t usually affect the flags (for obvious reasons: which half-word?), so you can’t use the S suffix with them. You can however use all the conditional codes.

It’s interesting to try to code this using old-style ARM instructions. It’s not difficult, but involves a lot of bit-twiddling, rotating, masking and separate treatment of each half-word. And needs a lot more instructions!

Saturation

I said above that results over the maximum storable will wrap round, so if you do a single byte addition of 230+70, you’ll get 44 (300-256). The uadd8 instruction adds the corresponding bytes in a pair of registers, like this:

.add_bytes
; uadd8 = add corresponding unsigned bytes
; IN - r1 B% = 4 byte values
;      r2 C% = 4 bytes to add
; OUT- r0 = result
uadd8 r0,r1,r2
mov   pc,lk

Adding &AAB0FFE6 to &442F0146 will give this, shown in hex and decimal:

Original  : &AA B0 FF E6 : 170 176 255 230
Plus      : &44 2F 01 46 :  68  47   1  70
Result    : &EE DF 00 2C : 238 223   0  44

Watch what happens with the usual add operation on the same inputs, add r0,r1,r2:

Word add  : &EE E0 01 2C : 238 224   1  44

The additions in the lowest two bytes have overflowed, causing carries into the next bytes.

The ARM offers saturated operations, where results are clamped to the minimum and maximum values storable in the bit-width. The uqadd8 instruction does unsigned saturated byte addition, like this:

; uqadd8 = unsigned add bytes, saturating
; IN - r1 B% = word
;      r2 C% = bytes to add
; OUT- r0 = result
uqadd8 r0,r1,r2
mov    pc,lk

With the same inputs as above, we get this:

Original  : &AA B0 FF E6 : 170 176 255 230
Plus      : &44 2F 01 46 :  68  47   1  70
Result    : &EE DF FF FF : 238 223 255 255

The overflow of the lowest two additions has been capped to 255.

One application of this would be simple brightness adjustment of an image. Full-colour pixels in RISC OS sprites are stored in a single word as 00BbGgRr, representing the blue, green and red channel values. (Though RISC OS can handle the 00RrGgBb format - PhotoDesk offers both options when saving.) To brighten an image, you just increase each channel byte like this:

; Brighten a pixel
; IN - r1 B% = colour word, in 00BbGgRr format
;      r2 C% = amount to brighten by (0..255)
; OUT- r0 = brightened pixel
orr    r2,r2,r2,lsl#16 ; high byte 2 is blue channel brightening…
orr    r2,r2,r2,lsr#8  ; …byte 1 is green
uqadd8 r0,r1,r2
mov    pc,lk

With a brightening of 40, we get this:

Original  : &00 DC BD FF :   0 220 189 255
Plus      : &00 28 28 28 :   0  40  40  40
Result    : &00 FF E5 FF :   0 255 229 255

The first two instructions copy the lowest byte of the brightening amount into bytes 1 and 2, which is then added to the colour word. And importantly, it saturates the resultant channel values — if it didn’t you’d get very odd results if it went over 255 and wrapped round.

To reduce brightness, just replace uqadd8 with uqsub8 — in this case, values are clamped to zero. And it’s a simple modification to allow different brightness modifiers for each channel, of course.

To invert the colours of a pixel, each channel is changed to 255-colour value, like this:

; Invert a pixel
; IN - r1 B% = colour word, in 00BbGgRr format
; OUT- r0 = inverted pixel
mvn    r0,#&ff000000
uqsub8 r0,r0,r1
mov    pc,lk

The mvn sets r0 to &00FFFFFF. Actually we don’t need SIMD to do this; using eor r0,r0,r1 would have the same effect.

Another common graphics operation is interpolation — generating a pixel’s colour values from two others, often its neighbours. Given two pixels, the single SIMD instruction uhadd8 can do this:

; Interpolate between two pixels
; IN - r1 B% = first colour word, in 00BbGgRr format
;      r2 C% = second colour word, in 00BbGgRr format
; OUT- r0 = interpolated pixel
uhadd8 r0,r1,r2
mov    pc,lk

What uhadd8 does is add correponding unsigned bytes, then halve the sum — so the result is the average of the channel values, like this:

Pixel 1   : &00 FF 7C F0 :   0 255 124 240
Pixel 2   : &00 01 24 3C :   0   1  36  60
Result    : &00 80 50 96 :   0 128  80 150

Note in particular the red channel in byte 0, which adds 240 and 60 giving 300. This can’t be stored in a byte, of course, but the addition stage is actually carried out with 9-bit accuracy so the halving gives the correct result, of 150.

Summary

It should be apparent that there’s some consistency to the names of SIMD mnemonics. The numeric suffix of 8 or 16 gives the data unit size. If the mnemonic starts with U the instruction deals with unsigned values, and an S prefix denotes signed values (we haven’t used any here, but for example sadd16 is the signed version of uadd16, adding pairs of signed half-words). A Q in the mnemonic means it’s a saturating operation, and an H does a halving. The convention isn’t always followed: uqadd8 does a saturated unsigned byte addition, but the signed version is just qadd8, not sqadd8.

We’ve only scratched the surface of the SIMD instructions here, but it should be enough to get you started. I recommend getting an up-to-date StrongHelp Assembler manual, which summarises all the SIMD instructions available.

Footnotes

  1. “Reduced” can be taken to extremes — it’s possible to implement a general-purpose computer with only one instruction. The usual example is subleq, “subtract and branch if less than or equal to zero”. Dawn is a whole operating system written using this instruction.
  2. NEON is basically SIMD on steroids

ROCoding #3: Ellipses and rings

Monday, February 23, 2026

The previous post in this series covered circular or radial blends and gradients. With a few simple modifications the code can be adapted to create elliptical and annular fills. Or eggs and doughnuts, if you prefer 😉.

Again, all these examples will be using various procedures from the previous posts.

RISC OS provides some graphics primitives to draw ellipses, and from BASIC this is:

ELLIPSE [FILL] centreX,centreY, width,height [, angle] 

We won’t be using this however, and we won’t be including the angle setting, which draws a rotated ellipse. Let’s not make it too complicated…
[Read more…]

ROCoding #2: Circles

Tuesday, January 20, 2026

The first post in this series covered filling a rectangle with blends and gradients. This time we’ll look at generating circular or radial blends and gradients, which is a bit more complicated. All these examples will be using various procedures from the previous post.

RISC OS provides graphics primitives to draw outline and filled circles. From BASIC:

ROC02circ0.webp
PROCinit
SYS CT_SetGCOL%,&40dd4000
CIRCLE 64,192,60
CIRCLE FILL 192,64,60

The parameters are the centre coordinates and the radius, in OS units. As an aside, here’s the additive RGB triplet as used in the PhotoDesk manual, showing the complementary cyan, magenta and yellow colours. The circles are blended with the OR operation:

ROC02circ0rgb.webp
PROCinit
SYS CT_SetGCOL%,red%
CIRCLE FILL 128,160,80
SYS CT_SetGCOL%,blue%,,,,1
CIRCLE FILL 170,88,80
SYS CT_SetGCOL%,green%,,,,1
CIRCLE FILL 92,88,80

So, can we use these primitives to create a radial blend? Here’s a first attempt:
[Read more…]

ROCoding #1: Rectangles

Friday, December 12, 2025

Coming back to RISC OS has been interesting. I’ve had to re-learn a number of things, like using the WIMP and, in particular, graphics programming. I hope this article will be the first of a series explaining my learning curve, and hopefully providing some useful programs. It’s fairly basic — in all senses! — but does assume some familiarity with RISC OS and BBC BASIC.

We’ll start with some things you can do when drawing rectangles. While writing the Solar application I wanted to provide some better backgrounds for the graphs. A plain background is easy, of course: just use RECTANGLE FILL with some appropriate colour, like this.

rect0.webp
GCOL 0,&ff,&dd,&ff
RECTANGLE FILL 0,0,256

…which draws a 256×256 square at the graphics origin. More generally, RECTANGLE accepts both width and height, which can be negative.

First up, we’ll generate a blend between two colours, from the bottom to the top of the rectangle. We’re assuming a full-colour display here, which is necessary for displaying blends with any fidelity. The Solar application graphs are built up by redirecting all graphics output to a sprite, then letting the Wimp handle displaying them on screen; this takes care of any mismatch between the screen and graph colour depths.
[Read more…]

BASIC 5 vs BASIC64 [updated]

Saturday, November 29, 2025

[Updated 5 Dec 2025. Now includes BASIC FPA timings, and text updated. DIV is now included.]

While writing the Gradgrind program (qv) I wondered if it would benefit from running under BASIC64 (or VI), as opposed to BASIC 5 (V). So I ran up a few simple speed tests (and they are simple — don’t take the results as definitive benchmarks). Although the program runs acceptably fast on my setup (a 4té2, which is based on a Pi 4b), it uses a fair amount of real arithmetic. The chief difference between the two versions is that BASIC 5 stores reals in 5 bytes, while BASIC64 uses 8 bytes. BASIC64 itself has two variants, VFP using the vector floating point operations available on the Pi’s ARM processor, and FPA using software floating point.

To use each variant:

  • basic <file> Runs <file> using BASIC V. This is also the default setup for double-clicking a BASIC file.
  • basic64 <file> Runs <file> under BASIC VI, which will select the VFP variant if available, otherwise uses FPA.
  • basicvfp <file> Forces the VFP variant, if available.
  • basicfpa <file> Forces the FPA variant.

The program to generate these results is available from the Downloads page.

System details:
R-Comp’s 4té2 (based on the Raspberry Pi 4b) running RISC OS 5.31 (21-Jul-24) at 1.5Ghz
BBC BASIC V 1.85 (03 Oct 2022)
BBC BASIC VI 1.85 (03 Oct 2022) VFP
BBC BASIC VI 1.85 (03 Oct 2022) FPA

Notes on the tests

  • All tests were run in single-tasking mode (not in a task window).
  • Each test (other than the WHILE/REPEAT ones) is enclosed in a FOR…NEXT loop, with the index variable running from 1 to 100,000,000. So they are run 100 million times. This was chosen so that an empty loop runs in about 1 second. Other than tests 3, 4 and 6, the index variable is an integer.
  • The description specifies what operation is inside the loop.
  • All timings are in seconds with centisecond resolution, other than the total time which is in minutes:seconds.
  • The “Assignment a%=long…” tests are done because I often use systematic prefixes for variable names, and I wondered if having many variable names with identical prefixes would slow things up. The testing program defines 300 variables called “VeryLongVariableNameWithIncrementingSuffix001%” to “VeryLongVariableNameWithIncrementingSuffix300%”.

Test BASIC V BASIC VI VFP BASIC VI FPA
Empty FOR loop (int) 0.97 1.02 0.98
Empty FOR loop (int, no spaces) 1.01 1.15 1.00
Empty FOR loop (real) 3.09 1.11 65.07
Empty FOR loop (real, no spaces) 3.10 1.10 65.05
Empty FOR loop, NEXT I% (int) 3.02 3.04 2.97
Empty FOR loop, NEXT I (real) 5.12 3.76 67.05
Assignment A%=100 4.17 4.13 23.61
Assignment a%=100 4.14 4.14 30.69
Assignment a%=b% (100) 4.39 4.39 23.63
Assignment a=0.5 4.19 4.02 53.65
Assignment a=b (0.5) 4.89 3.82 53.78
Assignment a%=Very…x001% 4.38 4.39 23.63
Assignment a%=Very…x300% 4.39 4.39 23.62
Integer maths a%=b%+c% (100+50) 6.62 6.63 26.16
Integer maths a%=b%-c% (100-50) 6.62 6.63 26.26
Integer maths a%=b%*c% (100*50) 6.78 6.78 26.28
Integer maths a%=b%/c% (100/50) 16.47 8.44 181.51
Integer maths a%=b%DIVc% (100DIV50) 7.96 7.62 27.55
Real maths a=b+c (0.5+0.2) 7.93 5.85 154.60
Real maths a=b-c (0.5-0.2) 8.43 5.88 135.51
Real maths a=b*c (0.5*0.2) 7.90 5.85 134.98
Real maths a=b/c (0.5/0.2) 17.88 6.69 171.36
Real maths a=b^c (0.5^0.2) 46.09 90.33 1651.64
Real maths a=SQRb (0.5) 15.96 5.40 103.40
Trig a=RADb (0.5) 6.24 4.52 85.88
Trig a=SINb (0.5) 19.28 46.44 768.88
Trig a=COSb (0.5) 28.86 47.25 794.56
Trig a=TANb (0.5) 26.89 74.45 757.27
WHILE loop (int, I%=I%+1) 10.89 10.80 31.46
WHILE loop (int, I%+=1) 9.39 9.38 9.38
REPEAT loop (int, I%=I%+1) 11.58 11.16 32.17
REPEAT loop (int, I%+=1) 9.52 9.49 9.52
Total (m:s) 5:18 6:50 92:43

Takeaways

  • BASIC VI FPA is slow. If you run the test program be prepared to cook and eat your dinner while it shuffles along. And possibly include a nap.
  • I’m puzzled why integer operations under BASIC VI FPA are so slow; I wasn’t expecting these to vary much. Addition, subtraction, multiplication and the DIV operator are about 4 times slower than BASIC V. Using / to divide is even worse, 11 times slower. And even simple integer assignments are about 6 times slower.
  • Note that using a real variable as the index in a FOR…NEXT loop in BASIC VI FPA is twenty times slower than BASIC V (and over sixty times slower than BASIC VI VFP).
  • Don’t use variables after NEXT; it’s three times slower, and they’re very rarely needed. This can make a considerable difference for nested FOR loops.
  • I’ve done both “A%=” and “a%=” assignments to check if so-called ‘resident integer variables’ (A% to Z%) still give a speed benefit. But it seems there’s no advantage now to using them.
    Ancient history note: this feature dates back to the Acorn Atom (my first computer back in 1980 — I soldered it together!), when A-Z were the only variables you had (integer only); if you wanted more you had to use arrays (just AA to ZZ) or indirection. BBC BASIC evolved from Atom BASIC — indeed, you could get it on the Atom as an add-on board — and the resident integer variables on the BBC Micro’s original BASIC were a legacy.
  • The long variable name tests don’t appear to make much difference. But it turns out that both versions use a fairly sophisticated caching strategy for variable names, so this isn’t really a valid test. I’d be interested to know if the same strategy is used for procedure/function names (which I haven’t tested here).
  • Integer maths is pretty much identical between BASIC V and BASIC VI VFP, with the exception of division which is twice as fast in BASIC VI VFP.
  • Simple real maths is faster in BASIC VI VFP than BASIC V, by about 25-30%; again, division is twice as fast. It’s interesting that BASIC VI VFP’s real maths operations are slightly faster than their integer equivalents. Of course, different values may give a different result.
  • Square root — a common operation — is three times faster in BASIC VI VFP.
  • Avoid exponentiation if you can — it’s a very expensive operation, in all versions.
  • I was a bit surprised that BASIC VI VFP’s trig functions weren’t better. Although they are, of course, much more accurate.
  • Using += and -= to increment/decrement variables gives a small but useful speed increase, particularly under BASIC VI FPA.

Conclusion

No, it wouldn’t really be worth Gradgrind using BASIC VI VFP. For much of the real arithmetic involved I’ve used pre-calculation, and the speed increase would be minimal.

Downloads

Download the program from the Downloads page.