Scrape text from browser surface (not HTHML)

  • Thread starter Thread starter _BNC
  • Start date Start date
B

_BNC

I need to do the equivalent of 'select all' and 'capture' on a browser
screen. I just want the text as it appears on-screen. Is there a simple
way to automate this?

The thing that may make it a bit awkward is that the browser windows
appear as popup-style windows, with no navigation buttons. That part
may not be so easy to automate.
 
Just fire a webrequest at it. The return is an html string which you can
parse (preferrably with a regex expression) to extract what you want. I have
some code somewhere but i believe i plaigerized it from the net so you
should be able to find a lot of code to do just that. regexlib.com will help
you with the appropriate regex expression
 
I need to do the equivalent of 'select all' and 'capture' on a browser
Just fire a webrequest at it. The return is an html string which you can
parse (preferrably with a regex expression) to extract what you want. I have
some code somewhere but i believe i plaigerized it from the net so you
should be able to find a lot of code to do just that. regexlib.com will help
you with the appropriate regex expression

That sounds like a good idea, Alvin, but I just checked regexlib.com and
came up empty. (Lots of good stuff there though, so thanks for the
pointer.)

I thought there would be a simpler way of bringing up the window and
inducing a text capture programmatically, just as if the user had done it.
Of course if there is an efficient regex expression, that would be great,
but it's tough to come up with efficient search keys for googling that
type of thing.

You'd think someone would have a nice C# function for turning ugly html
email into readable text. Or something equivalent.
 
well it's not that difficult to concoct a good regex expression to do this.
if you post a request for help in c# newsgroup, i'm sure you can get someone
like chris r to write you one. I steal stuff off of regexlib and modify it
for my devious purposes. i'm no good at writing that stuff from scratch.

i the webrequest code if you want it by the way
 
(Re extracting just visible text from HTML)

well it's not that difficult to concoct a good regex expression to do this.
if you post a request for help in c# newsgroup, i'm sure you can get someone
like chris r to write you one. I steal stuff off of regexlib and modify it
for my devious purposes. i'm no good at writing that stuff from scratch.

I confess to xposting this to the C# group in the hope that someone would
notice it. I'm sure I've seen something like this at one time, too. But
it would involve breaking down HTML tables, translating symbols like
'nbsp' and all that. I'm sure it's been done, but it soulds like a lot of
wheel-reinventing if I have to do it myself.
i the webrequest code if you want it by the way

That would be nice! Thanks.

BNC
 
this is a webservice i used to scrape telephone directory information from
anywho.com. It's pretty rough code but you should get the general idea

[WebMethod]

public string PhoneLookup(string strNumber, ref int counter)

{

string strResult = string.Empty, searchtext = string.Empty;

try

{

// Create a new 'Uri' object with the specified string.

Uri myUri =new
Uri("http://www.anywho.com/qry/wp_rl?npa="+strNumber.Substring(0,3) +
"&telephone="+ strNumber.Substring(3,7) + "&btnsubmit.x=36&btnsubmit.y=9");

// Creates an HttpWebRequest with the specified URL.

HttpWebRequest myHttpWebRequest = (HttpWebRequest)WebRequest.Create(myUri);

myHttpWebRequest.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT
5.1; Q312461; .NET CLR 1.0.3705)";

HttpWebResponse res = (HttpWebResponse)myHttpWebRequest.GetResponse();

StreamReader sr = new StreamReader(res.GetResponseStream(),
System.Text.Encoding.UTF8);

string pageContent = sr.ReadToEnd();

res.Close();

sr.Close();


int startpos = pageContent.IndexOf(@"bin/amap.cgi?") + 10;

if(startpos != -1)

{

int endpost = pageContent.IndexOf("Maps & Directions");

if(endpost != -1)

searchtext = pageContent.Substring(startpos, endpost - startpos + 17);

searchtext = searchtext.Replace(">Maps & Direct"," ");

searchtext = searchtext.Replace("gi?lastname="," ");

searchtext = searchtext.Replace("firstname=","\n");

searchtext = searchtext.Replace("+"," ");

searchtext = searchtext.Replace("="," ");

searchtext = searchtext.Replace("&"," ");

searchtext = searchtext.Replace("\\"," ");

searchtext = searchtext.Replace("\""," ");

searchtext = searchtext.Replace("city","\n");

searchtext = searchtext.Replace("state","\n");

searchtext = searchtext.Replace("zip","\n");

searchtext = searchtext.Replace("country","\n");

searchtext = searchtext.Replace("npatelephone","\n");

searchtext = searchtext.Replace("streetaddress","");

strResult = searchtext;

sr.Close();

pageContent = string.Empty;

}

}

catch

{

// return "No records exist";

// count++;

return null;

}

Agent.SelectedIndex = -1;

if(strResult.Trim() == string.Empty)

return null;

counter++;

return strResult + "\n\n";

}
 
Back
Top