News:

Building a 3D Ray Tracer  By stevmjon

Main Menu

Fast Dot

Started by kevin, January 31, 2006, 08:50:50 AM

Previous topic - Next topic

kevin

Spent most of today trying to remove some of the bottleneck from 'DOT' rendering. While previously it ok for a few points here and there, but would simply choke when trying to draw a full screen of dots.   This is not acceptable!

It turns out that a hell of a lot of time was lost in clipping overhead, so after a little tinkering. hey presto it's now able to draw 800*600 dots in about 400ms. Still not amazing, but that's about 10 times faster than it was previously for number of pixels.

One of the issues is that DOT is basically a safe render, that mean it's it supports clipping, and handles the destination buffer for you and the various draw modes.

So obviously we can gain a little more back but removing it's safeness and letting the user lock/unlock the buffers and handle clipping. Which gives us FastDOT

FastDot on my machine is about 2.5 times faster than DotC.  You can render a full screen of pixels 800*600 in 150 milliseconds on my duron 800mhzz.   That's about a 5/6fps for a pixel by pixel screen fill (code bellow).  Still not staggering, but a very healthy improvement.


Code for PlayBASIC V1.13


PlayBASIC Code: [Select]
w=getscreenwidth()
h=getscreenheight()
rendertoscreen


; Render the full Screen Using DOTC and record how long it takes
tim1=timer()
lockbuffer
For ypoint=0 To h
For xpoint=0 To w
dotc xpoint,ypoint,rgb(255,0,0)
Next
Next
unlockbuffer
tim1=timer()-tim1

c=rndrgb()
sync
waitkey


basec=rgb(255,0,255)
Do
cls 0

tim2=timer()
c=basec
lockbuffer
For ypoint=0 To h-1
For xpoint=0 To w-1
fastdot xpoint,ypoint,c
Next
c=c+xpoint+1
Next
unlockbuffer
tim2=timer()-tim2

; show the time of DOTC and FASTDOT
print tim1
print tim2
sync
loop






UPDATE NOTES: (14th Nov 2022)

        -Read PlayBASIC Help Files about FAST DOT 2
        -Read PlayBASIC Help Files about FAST DOT 3
        -Read PlayBASIC Help Files about FAST DOT 4


Draco9898

Very nice, getting these full scale pixel based rendering functions working fast seems like a pain in the tailhole :)
DualCore Intel Core 2 processor @ 2.3 ghz, Geforce 8600 GT (latest forceware drivers), 2 gigs of ram, WIN XP home edition sp2, FireFox 2.

"You'll no doubt be horrified to discover that PlayBasic is a Programming Language." -Kevin

kevin

#2
Well, the main drama is keeping things generic.  Generic and fast, really don't go together.  It doesn't matter how much fat I trim away from the edges, it's still generic.  A more viable approach would be to implement a pointer data type, so the user can write their own customer dot filler.  Although, this is really a situation where a concept like PB-Asm would shine  


Pointer example

PlayBASIC Code: [Select]
  Dim  Address as pointer
Dim FrameBufferAddress as pointer
FrameBufferAddress= GetSurfacePtr(0)
FrameBufferModulo=GetSurfaceModulo(0)
; assume as 32bit filler
For Ylp=1 to height
Address=FrameBufferAddress+(Ylp*FrameBUfferModulo)
For Xlp = 1 to width
*Address = Rgb(255,0,255)
inc Address,4
next
next




That would certainly be quicker in the long term, but the draw back is the user has to support all video formats manually.  


Conceptually, if we go ahead with PB-Asm,  this would probably be the fastest way to generate time critical code, without it being totally platform dependant.



Dim  Address as pointer
Dim  FrameBufferAddress as pointer
FrameBufferAddress= GetSurfacePtr(0)    
FrameBufferModulo=GetSurfaceModulo(0)
; assume as 32bit filler
 For Ylp=1 to height
   Address=FrameBufferAddress+(Ylp*FrameBUfferModulo)    
   FillColour =Rgb(255,0,255)
  Asm
   ; Seed registers  (R0 through R3   32bit)
       Mov.l R0, Width  
       Mov.l R1, FillColour
       Mov.l R2, Address
   ; Fill loop
Loop:
       Mov.l (R2),  R1  
       Add.l  R2,4
       DecBne R0,Loop
    EndAsm
next


 The main appeal of implementing something like  PB-Asm, would be it's a way to by pass the variables/pointers and manipulate memory directly.     The Asm segments could be jitted to the host platforms native machine code.  Given the simplicity of the potential instruction set, Most, if not all operations would translate 1 to 1.    In cycle terms that's about 4/5 cycles per pixel for that fill loop.  Compared to the 100's of cycles it takes now per pixel.    

 but anyway, I digress..

kevin

Fast Dot Revisited

  I've been quietly optimizing some of the old VM baggage away from PB1.17.  This is often necessary as over time things get bloated which can often be stream lined.   While I do have a pre-set standard benchmarks/results  I use when testing for speed, these are mainly math and loop orientated.  So I figured I'd use the raw DOT screen filler as gfx one.

Results,


 In the screen shot above the  PB1.13 is filling a screen full of  (800*600*32bit) pixels in 150ms.  
 PB1.17 now performs  this task in 132ms


 Test Machine Duron 800mhz, GF2 Video  WinXp pro

 In Frame rate terms that's above another full frame per second faster (as there's 20 milliseconds per frame).  Which doesn't sound impressive, but effectively that mean the brute looping crunching power of PB in this case is about %12 better in this situation.  

 If you calculate the fill rate per pixel, you can get an idea of just how many pixels my test machine can fill at  reasonable rate.  ( Fill rate =  (Fill WidthW *  Fill Height) / Milliseconds )    

 My Machine (duron 800mhz) fill about 3200 pixels per millisecond.  So it's fast enough to do this in 320*240*32Bit.at 38/40fps   Which is pretty staggering (to me at least), as it sure wasn't able to come close to that just a few weeks ago.
 



;makebitmapfont 1,$ffffff
w=getscreenwidth()
h=getscreenheight()

w=320
h=240

openscreen w,h,32,2
rendertoscreen
;ScreenVsync on

; Render the full Screen Using DOTC and record how long it takes
tim1=timer()
lockbuffer
 For ypoint=0 To h
  For xpoint=0 To w
    dotc xpoint,ypoint,rgb(255,0,0)
  Next
 Next
unlockbuffer
tim1=timer()-tim1


basec=rgb(255,0,255)
Do
cls 0

rendertoscreen
dot 0,ypoint

tim2=timer()
c=basec
 lockbuffer
 For ypoint=0 To h-1
  For xpoint=0 To w-1
   fastdot xpoint,ypoint,c
  Next
  c=c+xpoint+1
 Next
 unlockbuffer
tim2=timer()-tim2
basec=basec+w

; show the time of DOTC and FASTDOT
print fps()
print "MS"+str$(tim2)
print "Fill Rate:"+str$(Float(w*h)/tim2)
sync
loop




Draco9898

Are you going to throw PB-ASM in? Looks extremely useful...
DualCore Intel Core 2 processor @ 2.3 ghz, Geforce 8600 GT (latest forceware drivers), 2 gigs of ram, WIN XP home edition sp2, FireFox 2.

"You'll no doubt be horrified to discover that PlayBasic is a Programming Language." -Kevin

kevin

Probably, although it's not like I can just throw it in.  Effectively it's like producing a mini compiler, within a compiler.

thaaks

To me PB-Asm sounds like a way to circumvent engine/interpreter problems.
Personally I would avoid something like PB-Asm - it will result in a second language to be supported/improved/bugfixed...

Maybe it makes more sense to enhance the interpreter. The issue you're trying to solve looks pretty much like "HotSpot" from Java.
With JIT the SUN people were able to transform method code into machine code but the call stack (the sequence of methods to be called) was still interpreted in Java. So SUN worked on HotSpot which means they transform whole call stack regions into native code.
This works pretty well for big loops for example. Maybe that gives you some more ideas, Kevin...

But that's just my 2 cents  ;)

Cheers,
Tommy

Digital Awakening

Who actually have use of PB-Asm? As Thaaks says it's like a 2nd language with all the problems involved with it. Also PB is meant to be an easy way to program games. Personally I would like to se PB FX first and perhaps other things that are more directly useable for game creation. PB FX would allow us to do great looking modern 2D games. When that's taken care of there's nothing that stops including more complicated features.
Wisit my site at: DigitalAwakening.net

kevin

I knew this would be miss interpreted.  Implementing something like PB-Asm is a low priority , the concept is as old PB it self.  However, there is certainly a need for way to stream line time critical loops without the compiler generated overheads getting in the road.

 The plan has always been to compile the source down to one generic byte code instruction set  (which is what it already does),  then translate the byte code to native machine code were possible.   The translation  can occur either in the platform VM, or externally (aka of a module).    Anyway,  the issue (one of them)  is that no matter how clever how the code generator is,  it's highly unlikely to able to reach the speed of a manually set out asm loop.   But it'll certainly be a lot quicker either way :)

hartnell

The idea for PB-ASM opens up all kinds of new possibilities for PB. It would certainly attract wanna-be ASM coders and the computer science crowd. Imagine :

* Learn the fundamentals of ASM -- using PlayBasic!
* Learn the fundamentals of making your own operating system -- using PlayBasic!
* Learn the fundamentals of making your own programming language -- using PlayBasic!

I began a 6502 emulator project for this exact reason, sadly, I lack the time to program it myself.

It will be awhile before I'm able to get into computer science again, but I can definitely say that there is a market for this kind of thing.

The two requirements for attracting this audience would be

* Include PB-ASM in PB Source -- for people looking to write their own ASM routines.
* An option to develop using only the VM2.

If you ever want to continue with it, please post a brainstorming thread. :)

-- Shawn

kevin