Author Topic: Assembly Programmers - Help Axe Optimize!  (Read 136422 times)

0 Members and 1 Guest are viewing this topic.

Offline Munchor

  • LV13 Extreme Addict (Next: 9001)
  • *************
  • Posts: 6199
  • Rating: +295/-121
  • Code Recycler
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #135 on: January 06, 2011, 10:11:37 am »
By the way Quigibo, the reason I was looking at every source routine for Axe is because I'm documenting the size and (at least approximate) speed of every Axe command. If I finish it, would you want to bundle it with future Axe releases? If not I'd probably post it somewhere on the forums anyway, so people could still see it.
* happybobjr loves runner

And so does Scout.

Offline DJ Omnimaga

  • Clacualters are teh gr33t
  • CoT Emeritus
  • LV15 Omnimagician (Next: --)
  • *
  • Posts: 55941
  • Rating: +3154/-232
  • CodeWalrus founder & retired Omnimaga founder
    • View Profile
    • Dream of Omnimaga Music
Re: Assembly Programmers - Help Axe Optimize!
« Reply #136 on: January 07, 2011, 12:20:48 am »
Faster buffer inversion routine. 9951 cycles saved.
It's over 9000!!!!
What?!? O.O
9000?
 <_< Yea, I know... I had to...
But seriously dude, all those optimizations are awesome!  ;D
Lol I just actually noticed that ;D

Offline Runer112

  • Project Author
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2289
  • Rating: +639/-31
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #137 on: January 09, 2011, 01:00:51 pm »
Oh damn, you know what? Now I remember why I had the conditional return in the middle of the sprite rotating routines, Quigibo. Without it, the routines would return vx_SptBuff+8 in hl. Oops... But instead of re-implementing the conditional return, here's the better fix:

Code: [Select]
p_RotC:
.db __RotCEnd-1-$
ex de,hl
ld c,8
__RotCLoop1:
ld hl,vx_SptBuff+8
ld b,8
ld a,(de)
__RotCLoop2:
dec l
rra
rr (hl)
djnz __RotCLoop2
inc de
dec c
jr nz,__RotCLoop1
ret
__RotCEnd:

p_RotCC:
.db __RotCCEnd-1-$
ex de,hl
ld c,8
__RotCCLoop1:
ld hl,vx_SptBuff+8
ld b,8
ld a,(de)
__RotCCLoop2:
dec l
rla
rl (hl)
djnz __RotCCLoop2
inc de
dec c
jr nz,__RotCCLoop1
ret
__RotCCEnd:



EDIT: And as a side note, would it be possible to reformat DS<() so that the variable is reinitialized to its maximum value at the End? That way, 3 bytes could be saved by having both the zero and not zero conditions using the same store command. For example:

Code: [Select]
ld hl,(var)
dec hl
ld a,h
or l
jp nz,DS_End
;Code inside statement goes here
ld hl,max
DS_End:
ld (var),hl
« Last Edit: January 09, 2011, 04:08:20 pm by Runer112 »

Offline calc84maniac

  • eZ80 Guru
  • Coder Of Tomorrow
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2912
  • Rating: +471/-17
    • View Profile
    • TI-Boy CE
Re: Assembly Programmers - Help Axe Optimize!
« Reply #138 on: January 10, 2011, 05:35:19 pm »
Quigibo, you could probably optimize const->{expr} statements to give a lot of optimization benefits:
Code: [Select]
;const->{expr}
;Evaluate expr here
ld (hl),const

;const->{expr}r
;Evaluate expr here
ld (hl),const & $FF
inc hl
ld (hl),const >> 8

;const->{expr}rr
;Evaluate expr here
ld (hl),const >> 8
inc hl
ld (hl),const & $FF

These optimizations would still be compatible with code in earlier Axe versions because HL ends up exactly as it used to.

Edit:
These extra optimizations are also possible for storing 0:
Code: [Select]
;0->{expr}r or 0->{expr}rr
;Evaluate expr here
xor a
ld (hl),a
inc hl
ld (hl),a
« Last Edit: January 10, 2011, 05:39:47 pm by calc84maniac »
"Most people ask, 'What does a thing do?' Hackers ask, 'What can I make it do?'" - Pablos Holman

Offline Runer112

  • Project Author
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2289
  • Rating: +639/-31
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #139 on: February 06, 2011, 08:45:09 pm »
:-[ It looks like yet another error has been discovered with my attempts to optimize things. The nibble retrieval routines and the nibble storage routine that I posted treat low and high nibbles in opposite ways. I'm pretty sure that the nibble retrieval routines are backwards and that the conditional jr c jumps should be changed to jr nc.

Offline squidgetx

  • Food.
  • CoT Emeritus
  • LV10 31337 u53r (Next: 2000)
  • *
  • Posts: 1881
  • Rating: +503/-17
  • rawr.
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #140 on: February 06, 2011, 08:47:17 pm »
So would changing this make the new nibble routines opposite the ones found in .4.6? (or the same?)
« Last Edit: February 06, 2011, 08:54:03 pm by squidgetx »

Offline Builderboy

  • Physics Guru
  • CoT Emeritus
  • LV13 Extreme Addict (Next: 9001)
  • *
  • Posts: 5673
  • Rating: +613/-9
  • Would you kindly?
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #141 on: February 07, 2011, 01:11:13 am »
Nice catch, can't wait for the new version :)

Offline squidgetx

  • Food.
  • CoT Emeritus
  • LV10 31337 u53r (Next: 2000)
  • *
  • Posts: 1881
  • Rating: +503/-17
  • rawr.
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #142 on: February 14, 2011, 07:17:34 am »
Could this possibly be auto-optimized:

pxl-Test(CONST1,CONST2)

to

{CONST2*12+(CONST1/8)+L6}re(CONST1^8) (except ofc the math is all precalculated during parsing time.)? It saves more than 10 bytes and 200 cycles.

Offline Runer112

  • Project Author
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2289
  • Rating: +639/-31
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #143 on: February 15, 2011, 04:01:39 pm »
Some improvements to MemKit! :P


Next(): 2 bytes and a few cycles saved. Also, isn't the end-of-VAT check in the wrong place? I could be wrong because my VAT experience isn't too great, but because this routine checks for the end of the VAT at the start, wouldn't this command advance the VAT pointer to the end of the VAT and not recognize it as the end until the next Next()? This would cause problems with programs reading garbage VAT data for the last "entry." If I'm right about this (which may not be the case), the third block of code I posted should hopefully recognize the end of the VAT as soon as it hits it and never advance the VAT pointer to point to the end.

Code: (Original code: 26 bytes, 152/66 cycles) [Select]

 ld    hl,(axv_X1t)
 ld    de,($982E)
 or    a
 sbc   hl,de
 ret   z
 add   hl,de
 ld    de,-6
 add   hl,de
 ld    e,(hl)
 inc   e
 xor   a
 ld    d,a
 sbc   hl,de
 ld    (axv_X1t),hl
 ret
 
   
Code: (Optimized code: 24 bytes, 144/66 cycles) [Select]

 ld    hl,(axv_X1t)
 ld    de,($982E)
 or    a
 sbc   hl,de
 ret   z
 add   hl,de
 ld    de,-6
 add   hl,de
 ld    a,(hl)
 cpl
 ld    e,a
 add   hl,de
 ld    (axv_X1t),hl
 ret
 
   
Code: (Optimized (and fixed?) code: 24 bytes, 144/113 cycles) [Select]

 ld    hl,(axv_X1t)
 ld    de,-6
 add   hl,de
 ld    a,(hl)
 cpl
 ld    e,a
 add   hl,de
 ld    de,($982E)
 or    a
 sbc   hl,de
 ret   z
 add   hl,de
 ld    (axv_X1t),hl
 ret
 


Dim()rr: Fixed the page offset.

Code: (Original code) [Select]

 ld    ix,(axv_X1t)
 ld    l,(ix-6)
 ld    h,0
 
   
Code: (Fixed code) [Select]

 ld    ix,(axv_X1t)
 ld    l,(ix-5)
 ld    h,0
 


Print(): n*16-13 cycles saved, n=name length. Assuming an average name length of 4.5 characters, 59 cycles saved.

Code: (Original code: 18 bytes, n*55+51 cycles) [Select]

 ld    ix,(axv_X1t)
 ld    b,(ix-6)
Ax6_Loop:
 ld    a,(ix-7)
 ld    (hl),a
 inc   hl
 dec   ix
 djnz  Ax6_Loop
 ld    (hl),b
 ret
 
   
Code: (Optimized code: 18 bytes, n*39+64 cycles) [Select]

 ex    de,hl
 ld    hl,(axv_X1t)
 ld    bc,-6
 add   hl,bc
 ld    b,(hl)
 ex    de,hl
Ax6_Loop:
 dec   de
 ld    a,(de)
 ld    (hl),a
 inc   hl
 djnz  Ax6_Loop
 ld    (hl),b
 ret
 
« Last Edit: April 01, 2011, 01:39:40 pm by Runer112 »

Offline Runer112

  • Project Author
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2289
  • Rating: +639/-31
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #144 on: February 16, 2011, 01:47:44 pm »
Yay, double post! But it's been almost a day and I have a pretty good question/suggestion. This relates to the screen display commands. This was brought to mind when squidgetx made a post mentioning something I had discovered a while ago when documenting the speed of Axe commands. What he mentioned is that DispGraphr actually runs faster than DispGraph. Here's a quote of my response to that:

I see you've been reading up on my Commands documentation, eh squidgetx? Yeah, that's an interesting thing I discovered when speed testing the display commands. On calculators like mine with the old, "good" screen drivers, the screen driver delay seems to be pretty low and constant from calculator to calculator. DispGraph could run just as fast or faster than DispGraphr on these calculators. However, due to inconsistencies with the screen drivers in newer units, the routine may run too fast for the driver on some calculators, causing display problems, so Quigibo had to add a portion of code to pause the routine until the driver says it is ready. However, this pause itself adds some overhead time, making the routine slower.

Quigibo, the DispGraphr routine doesn't have any throttling system in place, yet no problems have been reported with it on newer calculators. Could you just remove the throttling system from the DispGraph routine and add one or two time-wasting instructions to make each loop iteration take as many cycles as each DispGraphr loop iteration?


EDIT: Hmm I don't know if Quigibo reads this thread and would see that, so I'm probably going to post that in a major thread he reads or send him a message about that.


The second paragraph is my suggested optimization. The 3-level grayscale routine doesn't have a throttling system, yet there have been no reports of display problems from anybody. Wouldn't this suggest that all the screen drivers can handle routines that have as much delay as this? The data copying loop in the 3-level grayscale routine takes 72 cycles per byte output, so could delays simply be added to the normal screen display routine to make its loop at least 72 cycles?

Offline Munchor

  • LV13 Extreme Addict (Next: 9001)
  • *************
  • Posts: 6199
  • Rating: +295/-121
  • Code Recycler
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #145 on: February 16, 2011, 02:58:19 pm »
Quote
Print()

What function would that be Runer?

Offline Quigibo

  • The Executioner
  • CoT Emeritus
  • LV11 Super Veteran (Next: 3000)
  • *
  • Posts: 2031
  • Rating: +1075/-24
  • I wish real life had a "Save" and "Load" button...
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #146 on: February 16, 2011, 11:32:36 pm »
@squidgetx
I don't think pixel testing points with constant coordinates is common enough to warrant the pixel tester to treat it as a special case.  99% of the time, you're going to be using variable arguments to test pixels.  If not, the code can probably be made more efficient without a pixel test in the first place.

The second paragraph is my suggested optimization. The 3-level grayscale routine doesn't have a throttling system, yet there have been no reports of display problems from anybody. Wouldn't this suggest that all the screen drivers can handle routines that have as much delay as this? The data copying loop in the 3-level grayscale routine takes 72 cycles per byte output, so could delays simply be added to the normal screen display routine to make its loop at least 72 cycles?

Unfortunately that is not entirely true.  There has actually been at least 1 report that the 3-level routine is too fast and causes flickers once in a great while on very new hardware.  If there was a lower bound for clock cycles, I'm right on it.  Although, I could still probably take the safety stuff off the safe copy routine, still have it faster (but not too fast) and still be smaller.  I will look into that.

And I do read most of these threads, I'm just generally too busy to post, but I try to when I have small pockets of free time :)
___Axe_Parser___
Today the calculator, tomorrow the world!

Offline Runer112

  • Project Author
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2289
  • Rating: +639/-31
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #147 on: February 17, 2011, 09:47:36 pm »
Now that you have absolute jumps implemented:

Code: (Original code) [Select]

p_Exchange:
.db 13
pop de
ex (sp),hl
pop bc
ld a,(de)
ldi
dec hl
ld (hl),a
inc hl
ld a,b
or c
jr nz,$-8

   
Code: (Optimized code) [Select]

p_Exchange:
.db 12
pop de
ex (sp),hl
pop bc
__ExchangeLoop:
ld a,(de)
ldi
dec hl
ld (hl),a
inc hl
jp pe,__ExchangeLoop ;or is it po?



« Last Edit: February 17, 2011, 10:23:13 pm by Runer112 »

Offline Runer112

  • Project Author
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2289
  • Rating: +639/-31
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #148 on: February 20, 2011, 06:47:27 pm »
I felt bad last time I optimized the constant bit-checking auto optimizations because I left about half of them out, stuck with the 8-byte plain old bit check routine. But thanks to a random revelation I had while lying in bed last night, I have come back for the forgotten ones!


Code: (Original code) [Select]
p_GetBit2:
.db 7 ;7 bytes, 49 cycles
xor a
add hl,hl
add hl,hl
add hl,hl
ld h,a
rla
ld l,a
p_GetBit3:
.db 8 ;8 bytes, 30/29 cycles
bit 4,h
ld hl,0
jr z,$+3
inc l

p_GetBit4:
.db 8 ;8 bytes, 30/29 cycles
bit 3,h
ld hl,0
jr z,$+3
inc l

p_GetBit5:
.db 8 ;8 bytes, 30/29 cycles
bit 2,h
ld hl,0
jr z,$+3
inc l

p_GetBit10:
.db 7 ;7 bytes, 49 cycles
xor a
add hl,hl
add hl,hl
ld h,a
add hl,hl
ld l,h
ld h,a
p_GetBit11:
.db 8 ;8 bytes, 30/29 cycles
bit 4,l
ld hl,0
jr z,$+3
inc l

p_GetBit12:
.db 8 ;8 bytes, 30/29 cycles
bit 3,l
ld hl,0
jr z,$+3
inc l

p_GetBit13:
.db 8 ;8 bytes, 30/29 cycles
bit 2,l
ld hl,0
jr z,$+3
inc l

 
   
Code: (Optimized code) [Select]
p_GetBit2:
.db 7 ;7 bytes, 37 cycles
ld a,h
set 5,h
cp h
sbc hl,hl
inc hl


p_GetBit3:
.db 7 ;7 bytes, 37 cycles
ld a,h
set 4,h
cp h
sbc hl,hl
inc hl
p_GetBit4:
.db 7 ;7 bytes, 37 cycles
ld a,h
set 3,h
cp h
sbc hl,hl
inc hl
p_GetBit5:
.db 7 ;7 bytes, 37 cycles
ld a,h
set 2,h
cp h
sbc hl,hl
inc hl
p_GetBit10:
.db 7 ;7 bytes, 37 cycles
ld a,l
set 5,l
cp l
sbc hl,hl
inc hl


p_GetBit11:
.db 7 ;7 bytes, 37 cycles
ld a,l
set 4,l
cp l
sbc hl,hl
inc hl
p_GetBit12:
.db 7 ;7 bytes, 37 cycles
ld a,l
set 3,l
cp l
sbc hl,hl
inc hl
p_GetBit13:
.db 7 ;7 bytes, 37 cycles
ld a,l
set 2,l
cp l
sbc hl,hl
inc hl
 

Offline DJ Omnimaga

  • Clacualters are teh gr33t
  • CoT Emeritus
  • LV15 Omnimagician (Next: --)
  • *
  • Posts: 55941
  • Rating: +3154/-232
  • CodeWalrus founder & retired Omnimaga founder
    • View Profile
    • Dream of Omnimaga Music
Re: Assembly Programmers - Help Axe Optimize!
« Reply #149 on: February 22, 2011, 12:15:45 am »
Nice to see new optimizations :D