Here's an attempt at getting quite a lot done in not much space. It's a playable pong game in 512 bytes.
It's not as crammed in as it could be - I squashed it quite a bit after getting the ball bouncing around, when I figured out a few things about ARM assembly, but I got home at nearly 2am today after a wedding celebration, and hacked the bat and game logic in the last hour, using up about 124 bytes - I haven't really condensed that code much yet.
- Put everything in .text, rewrite elf headers to make .text writable, strip redundant stuff out of the elf file (now automated, I should upload the stripper too I guess)
- Learn what the S suffix does - lots of CMP instructions become redundant if you use it well. I knocked off over 100 bytes last night, mostly through this stuff, combined with...
- Use the conditional execution suffixes instead of branches when you can. My ball move and bounce code in the X direction is only three instructions. (Y is more complicated because of the bat.) They often seem to do exactly what you want them to do, and make a lot of branches redundant.
- Use the old libc4 mmap syscall, not the new one. You can then reuse the mmap parameter block for multiple mmaps. I mmap the same amount of memory for both GPIO registers and video memory - it doesn't do any harm.
- For graphics rendering, the pre- and post-increment indirect addressing modes are really handy. Aligning objects of word boundaries also lets you draw them with less instructions. You can only increment by the amount you offset the index by, with pre-indexed indirect addressing, but with plain indirect addressing you can increment it by whatever you want, which is handy.
- Don't use a stride of 320! Use 512. Multiplies are now plain shifts, which means you can do a single add instruction to combine x and y coordinates. It costs you one extra bit, which might harm your immediate constant loading, but only if you're loading funny alignments, which is worth avoiding anyway.
- Get creative with data ordreing. "/dev/mem" needs to be null-terminated, but surely you have some other variable which has its first byte zero. Failing that, at least use the next three bytes for something useful, or they'll likely get swallowed up by padding to word boundaries. I did similar things elsewhere too - my GPIO register setting iterates over a block, with one word per register to set, containing register and value. It's terminated by zero, but that zero word is also the first word of the mmap parameter block.
Possible further optimisations
- The bat logic and game logic (lose condition) aren't optimised, they could probably get smaller
- There's some exit code there which never gets reached (2 instructions)
- My elf patcher doesn't overlap the program header with the elf header, which is normally possible on this architecture at least - saving another 2 instructions (8 bytes)
- I don't use the compression hack - I don't feel comfortable with it