ROCoding #4: ARM Wrestling
[Edit 2026-04-18: Fixed incorrect byte ordering of colour channels.]
It’s been 25 years — blimey! — since I did any serious ARM coding. The last was fixing some bugs in my Lisp interpreter in 2001, after which my Iyonix died and I moved over to Linux.
Back with RISC OS now, and in the intervening years the ARM processor has had some substantial changes. Originally it was the Acorn RISC Machine, of course, and it first appeared in the Acorn Archimedes computer in 1987. It was a genuine breakthrough at the time, a custom-designed (by Sophie Wilson et al) 32-bit processor running at 8MHz. I’m now using a 4té2, a repackaged Raspberry Pi 4b containing an ARM Cortex-A72, running at 1.8GHz — over 200 times faster.
And while the original ARM chips were indeed Reduced Instruction Set Computers, with only about 25 instructions1, these days it’s something of a misnomer. So what’s been added? SIMD and NEON, mostly. This article is a simple introduction to using some SIMD instructions; we’ll cover NEON2 later.
The Dwarf Mini smart scope — see other posts — saves image files as FITS files. This is a standardised format for storing scientific and astronomical images and has a few peculiarities, as I found when I started investigating it.
First, the pixel data is encoded as a stream of signed 16-bit twos-complement integers (other numeric formats are possible, but that’s what I’m dealing with here). Why signed? I don’t know, but that’s the only format for 16-bit integers in the FITS specification. In order to get unsigned values we need to add a constant to each number, which is specified in the FITS header — in this case 32768. So we’re mapping the range -32768…32767 to 0…65535.
Second, FITS stores numbers the wrong way round (with respect to ARM processors, anyway), with the most significant byte first: big-endian, not little-endian.
It’s quite possible to cope with this in BASIC, though a bit of a faff. And slow. Which was what prompted me to investigate the new (to me) ARM features available. And the good people who look after RISC OS have ensured that BBC BASIC’s assembler has been updated to handle the new opcodes. To emphasise: in order to run these code fragments — the simple surrounding BASIC code to set up the assembler is left as an exercise — you’ll need to be running a recent RISC OS on a recent computer, with a recent BASIC. In BASIC, type HELP [ to get a list of ARM instructions — if it includes a load of interesting-sounding assembler mnemonics like UXTAB16, SXTH and UQASX you’re good to go.
SIMD
The ARM processor has 32-bit, 4-byte registers, and instructions usually operate on the whole 32-bit word. So add r0,r1,r2 adds r1 and r2 together, and puts the result in r0. SIMD stands for Single Instruction, Multiple Data, and lets us operate on only parts of a word. So, for example, you can perform four 8-bit additions in a single operation. It’s a kind of parallel processing, which can be very useful for dealing with images — remember that a full-colour pixel is a single word containing red, blue and green components in three of the four bytes.
To get the FITS data in a usable form, we first need to invert the byte order of each 16-bit word. If we read a whole word of FITS data into register r1 we’ll have the values for two pixels, and the SIMD ARM instruction rev16 will do exactly what we need:
.hreverse ; half-word byte reversal ; IN - r1 = pair of half-words ; OUT - r0 = half-word bytes reversed rev16 r0,r1 mov pc,lk
Calling machine code from BASIC initialises registers r0 to r7 with whatever is in the variables A% to H%. So we set the BASIC variable B% to some value, &03B2C1D0 here, and call this code with USR, which returns r0. The single instruction swaps the byte order of each of the 16-bit words in r1, and puts the result in r0 which is then returned. For example:
Original: &03B2 C1D0 Reversed: &B203 D0C1
I’ve separated the 16-bit half-words with a space to make it clearer. Actually, ARM processors can now operate with big-endian data, but changing that globally makes me very nervous…
Now we need to add in the correction to get unsigned numbers; in other words, we normalise the data. Again, ARM can, er, lend a hand: the uadd16 instruction adds corresponding unsigned half-words together:
.hnormalise ; Half-word normalisation ; IN - r1 = pair of half-words ; r2 = offsets to normalise ; OUT - r0 = half-words added to the offsets uadd16 r0,r1,r2 mov pc,lk
Each half-word in r1 will be added to the corresponding half-word in r2. So with various values in B%, and &80008000 in C% (32768 in hex, in each half-word) we get:
Original : &8000 8020 Normalised: &0000 0020 Original : &7FFF 1234 Normalised: &FFFF 9234
Of course, you’d normally amalgamate these routines (or even define a macro — it’s only two instructions!):
.hrevnormalise ; Half-word byte-order correction and normalisation ; IN - r1 = pair of half-words ; r2 = offsets to normalise ; OUT - r0 = half-words reversed, and added to the offsets rev16 r0,r1 uadd16 r0,r0,r2 mov pc,lk
What’s very important to realise is that the operations on each half-word are entirely separate. If the sum exceeds the maximum value storable in the lower 16-bit word (65535), it won’t affect the bits in the other half-word; it just wraps round. Also note that these SIMD instructions don’t usually affect the flags (for obvious reasons: which half-word?), so you can’t use the S suffix with them. You can however use all the conditional codes.
It’s interesting to try to code this using old-style ARM instructions. It’s not difficult, but involves a lot of bit-twiddling, rotating, masking and separate treatment of each half-word. And needs a lot more instructions!
Saturation
I said above that results over the maximum storable will wrap round, so if you do a single byte addition of 230+70, you’ll get 44 (300-256). The uadd8 instruction adds the corresponding bytes in a pair of registers, like this:
.add_bytes ; uadd8 = add corresponding unsigned bytes ; IN - r1 B% = 4 byte values ; r2 C% = 4 bytes to add ; OUT- r0 = result uadd8 r0,r1,r2 mov pc,lk
Adding &AAB0FFE6 to &442F0146 will give this, shown in hex and decimal:
Original : &AA B0 FF E6 : 170 176 255 230 Plus : &44 2F 01 46 : 68 47 1 70 Result : &EE DF 00 2C : 238 223 0 44
Watch what happens with the usual add operation on the same inputs, add r0,r1,r2:
Word add : &EE E0 01 2C : 238 224 1 44
The additions in the lowest two bytes have overflowed, causing carries into the next bytes.
The ARM offers saturated operations, where results are clamped to the minimum and maximum values storable in the bit-width. The uqadd8 instruction does unsigned saturated byte addition, like this:
; uqadd8 = unsigned add bytes, saturating ; IN - r1 B% = word ; r2 C% = bytes to add ; OUT- r0 = result uqadd8 r0,r1,r2 mov pc,lk
With the same inputs as above, we get this:
Original : &AA B0 FF E6 : 170 176 255 230 Plus : &44 2F 01 46 : 68 47 1 70 Result : &EE DF FF FF : 238 223 255 255
The overflow of the lowest two additions has been capped to 255.
One application of this would be simple brightness adjustment of an image. Full-colour pixels in RISC OS sprites are stored in a single word as 00BbGgRr, representing the blue, green and red channel values. (Though RISC OS can handle the 00RrGgBb format - PhotoDesk offers both options when saving.) To brighten an image, you just increase each channel byte like this:
; Brighten a pixel ; IN - r1 B% = colour word, in 00BbGgRr format ; r2 C% = amount to brighten by (0..255) ; OUT- r0 = brightened pixel orr r2,r2,r2,lsl#16 ; high byte 2 is blue channel brightening… orr r2,r2,r2,lsr#8 ; …byte 1 is green uqadd8 r0,r1,r2 mov pc,lk
With a brightening of 40, we get this:
Original : &00 DC BD FF : 0 220 189 255 Plus : &00 28 28 28 : 0 40 40 40 Result : &00 FF E5 FF : 0 255 229 255
The first two instructions copy the lowest byte of the brightening amount into bytes 1 and 2, which is then added to the colour word. And importantly, it saturates the resultant channel values — if it didn’t you’d get very odd results if it went over 255 and wrapped round.
To reduce brightness, just replace uqadd8 with uqsub8 — in this case, values are clamped to zero. And it’s a simple modification to allow different brightness modifiers for each channel, of course.
To invert the colours of a pixel, each channel is changed to 255-colour value, like this:
; Invert a pixel ; IN - r1 B% = colour word, in 00BbGgRr format ; OUT- r0 = inverted pixel mvn r0,#&ff000000 uqsub8 r0,r0,r1 mov pc,lk
The mvn sets r0 to &00FFFFFF. Actually we don’t need SIMD to do this; using eor r0,r0,r1 would have the same effect.
Another common graphics operation is interpolation — generating a pixel’s colour values from two others, often its neighbours. Given two pixels, the single SIMD instruction uhadd8 can do this:
; Interpolate between two pixels ; IN - r1 B% = first colour word, in 00BbGgRr format ; r2 C% = second colour word, in 00BbGgRr format ; OUT- r0 = interpolated pixel uhadd8 r0,r1,r2 mov pc,lk
What uhadd8 does is add correponding unsigned bytes, then halve the sum — so the result is the average of the channel values, like this:
Pixel 1 : &00 FF 7C F0 : 0 255 124 240 Pixel 2 : &00 01 24 3C : 0 1 36 60 Result : &00 80 50 96 : 0 128 80 150
Note in particular the red channel in byte 0, which adds 240 and 60 giving 300. This can’t be stored in a byte, of course, but the addition stage is actually carried out with 9-bit accuracy so the halving gives the correct result, of 150.
Summary
It should be apparent that there’s some consistency to the names of SIMD mnemonics. The numeric suffix of 8 or 16 gives the data unit size. If the mnemonic starts with U the instruction deals with unsigned values, and an S prefix denotes signed values (we haven’t used any here, but for example sadd16 is the signed version of uadd16, adding pairs of signed half-words). A Q in the mnemonic means it’s a saturating operation, and an H does a halving. The convention isn’t always followed: uqadd8 does a saturated unsigned byte addition, but the signed version is just qadd8, not sqadd8.
We’ve only scratched the surface of the SIMD instructions here, but it should be enough to get you started. I recommend getting an up-to-date StrongHelp Assembler manual, which summarises all the SIMD instructions available.
Footnotes
- “Reduced” can be taken to extremes — it’s possible to implement a general-purpose computer with only one instruction. The usual example is subleq, “subtract and branch if less than or equal to zero”. Dawn is a whole operating system written using this instruction. ↩
- NEON is basically SIMD on steroids ↩