News:

Building a 3D Ray Tracer  By stevmjon

Main Menu

Download & Strip Html From String

Started by kevin, February 28, 2015, 11:56:46 PM

Previous topic - Next topic

kevin

    Download & Strip Html From String

    Needed a simple way to strip html, for a set known html documents, so wrote this.. Works very well for the tested pages around (small sample), can easily be expanded to make it smarter..  which you can do yourself !


    Keywords:  HTTP DownloadURL StripHTML URLTOFILE

PlayBASIC Code: [Select]
   // include the HTTP library
#include "http"

/*
*=---------------------------------------------------------------------=*

>> HTML STRIPPER - DUMB VERSION <<

By: Kevin Picone

Started: 1st,Mar,2015

(c) copyright 2015 All rights reserved

underwaredesign.com playbasic.com
*=---------------------------------------------------------------------=*

What Does it do ?

The focus of this example was to build a function that does
a reasonable job of stripping HTML tags from a string. The strip
function is basically dumb though, by that i mean, it's not HTML aware
rather it simply assumes that text in the form of < TAG > or < / TAG >
is html. It's possible that it's not.

To support more tags directly, you can drop them into the SELECT
statement in the middle of the function to apply any required logic
yourself.

*/


print "Downloading":sync

url$="http://underwaredesign.com"
Html$=DownloadFileToString(url$)

CleanText$=Strip_Html_From_String(Html$)

#print "--------------------------------------------------------------------"
#print "-[ ORIGINAL HTML ]-----------------------------------------------------"
#print "--------------------------------------------------------------------"
#print HTML$


#print "--------------------------------------------------------------------"
#print "-[ CLEANED TEXT ]-----------------------------------------------------"
#print "--------------------------------------------------------------------"
#print CleanText$

dim words$(0)
Count=Splittoarray(CleanText$," ",Words$(),0,7)

print "Word Count:"+Str$(Count)

Xpos=GetCursorX()
Ypos=GetCursorY()
Th=GetTExtHeight("|")
for lp =0 to Count
ThisWord$=Words$(lp)
WidthInPixels=GetTextWidth(ThisWord$+" ")
if (Xpos+WidthINPixels)>GetSUrfaceWidth()
Xpos=0
Ypos+=Th
endif
Text Xpos,Ypos,ThisWord$
Xpos+=WidthInPixels
next

ypos+=TH

text 0,ypos,"DONE"


Sync
WaitKEY
end




;----------------------------------------------------------------------------
;----------------------------------------------------------------------------
;---[ LOAD FILE TO STRING ]--------------------------------------------------
;----------------------------------------------------------------------------
;----------------------------------------------------------------------------



Function LoadFileToString(file$)
if FIleexist(file$)
size=filesize(file$)
ThisBank=NewBank(size+256)
f=readnewfile(file$)
readmemory f,GetBankPtr(ThisBank),size
closefile f
result$=peekbankstring(ThisBank,0,size)
deletebank thisbank
endif
EndFunction Result$



;----------------------------------------------------------------------------
;----------------------------------------------------------------------------
;---[ DOWNLOAD (FROM WEB) FILE TO STRING ]------------------------
;----------------------------------------------------------------------------
;----------------------------------------------------------------------------


Function DownloadFileToString(Url$)

; get a free Http Index
local Index = GetFreeHttp()
; open a Http session
OpenHttp(Index)
ConnectHttp(Index, Url$)
RequestHttpData(Index, "") ; request data

Repeat
local transfered = HttpTransfer(Index) ; poll transfer
sync
Until transfered = 0 ; If 0 bytes were transferred we can assume the transfer is finished

TempBank=Newbank(1)

CopyHttpDataToBank(Index, TempBank)

; closes the transfer, all internal memory resources are freed
CloseHttpTransfer Index

; disconnects from the Url
DisconnectHttp Index

; closes the session
CloseHttp Index

size=getbankSize(TempBank)

Result$=PeekBankString(TempBank,0,size)

deletebank TempBank

EndFunction Result$




/*----------------------------------------------------------------------------
----------------------------------------------------------------------------
Login required to view complete source code



kevin


   DownloadUrl$  DownloadUrlToFile$


    This code uses the build in OS funtion to download a URL to a file.  It supports Http & Https urls making it an easy way to grab external data.


PlayBASIC Code: [Select]
   url$="http://underwaredesign.com"
page$=DownloadUrl$(Url$)
#print page$

sync
waitkey



linkdll "urlmon.dll"
API_URLDownloadToFile(pCaller,url$,Filename$,reserved,lpfncb) alias "URLDownloadToFileA" as integer
endlinkdll


function DownloadUrl$(url$)
SaveFilename$=tempdir$()+MakeFilenameString(32)+".txt"
Status=DownloadUrltoFile(Url$,SaveFilename$)
if Status=0
Html$=LoadFileToString(SaveFilename$)
endif
if fileexist(SaveFilename$) then DeleteFile SaveFilename$

EndFunction Html$


Function DownloadUrltoFile(Url$,SaveFilename$)
// check for passing empty strings in
if Len(Url$)>1 and len(SaveFilename$)>1

// here you probably should do some checking of the URL and
// download path.. But I haven't bothered :)

Status= API_URLDownloadToFile(pCaller,url$,SaveFileName$,reserved,lpfncb)
endif

EndFunction Status






Function LoadFileToString(file$)
if FIleexist(file$)
size=filesize(file$)
ThisBank=NewBank(size+256)
f=readnewfile(file$)
readmemory f,GetBankPtr(ThisBank),size
closefile f
result$=peekbankstring(ThisBank,0,size)
deletebank thisbank
endif
EndFunction Result$



function MakeFilenameString(Size)
// Make a unique randomized filename
local ChrSet$="0123456789"
for lp=asc("a") to asc("z")
ChrSet$+=chr$(lp)
next
for lp=asc("A") to asc("Z")
ChrSet$+=chr$(lp)
next

for lp=1 to size
ThisChr = rnd#(len(ChrSet$)-1)
name$+=mid$(chrset$,ThisChr,1)
next

EndFunction Name$