Extraction Code

  • Thread starter Thread starter JoJo
  • Start date Start date
J

JoJo

Folks:


I have an HTML file (see general structure below ) that is about 100 pages
long. Scattered throughout this document (in a really disorganized fashion)
are 6 or 7
categories or fields of information like: Name of article, Author of
Article, Date Published, Comment, Full Story, Printer version, etc. Some of
this information in
my document is in the form of HTML hyperlinks. I am attempting to extract
some of this information in such a way that the HTML links are preserved and
not
automatically transformed to pure text.

Specifically, I am interested in the DOS code that would extract the
following 2 pieces of information from my HTML document and create a
separate document:
(1) Names of the articles (always starts with ">>" followed by the
actual article name in the HTML format
(11) Date published ( | Published 08/9/2009 | ) Note this information
is contained between the "|"

* I am interested in the DOS code to extract these 2 pieces of
information listed above.






--------------HTML file that is about 100 pages
long----------------------------------------

Articles by this Author - Page 1 (Page 1 of 100) « Back | 1
| 2 | 3 | 4 | 5 | Next »
By James Johnson | Published 08/9/2009 |
Do you have a written goal of what you expect to make from your trading this
year?
full story printer version
.............................................................................
.................................................................
----------------------------------------------------------------------------
 
 Folks:

I have an HTML file (see general structure below ) that is about 100 pages
long. Scattered throughout this document (in a really disorganized fashion)
are 6 or 7
categories or fields of information like: Name of article, Author of
Article, Date Published, Comment, Full Story, Printer version, etc. Some of
this information in
my document is in the form of HTML hyperlinks. I am attempting to extract
some of this information in such a way that the HTML links are preserved and
not
automatically transformed to pure text.

Specifically, I am interested in the DOS code that would extract the
following 2 pieces of information from my HTML document and create a
separate document:
   (1)    Names of the articles (always starts with ">>"  followed by the
actual article name in the HTML format
   (11)  Date published ( | Published 08/9/2009 | )  Note this information
is contained between the "|"

  *   I am interested in the DOS code to extract these 2 pieces of
information listed above.

--------------HTML file that is about 100 pages
long----------------------------------------

Articles by this Author - Page 1                (Page 1 of 100)   « Back | 1
| 2 | 3 | 4 | 5 | Next »


By James Johnson | Published 08/9/2009 |
Do you have a written goal of what you expect to make from your trading this
year?
full story    printer version
.............................................................................
................................................................
----------------------------------------------------------------------------

By 'DOS', I assume you mean at a Window's command prompt - not
literally with a version of DOS. It would be easier to do this with
Windows Scripting, but depending on the exact internal structure of
the page(s), a FIND or FINDSTR (maybe in conjunction with a FOR loop)
should be able to do this. However, regardless of the scripting
approach much more information about the structure of the HTML code is
required to be able to do this.

Specifically, the HTML tags that define the anchors are needed as well
as how many lines of text the information you desire occupy in the
HTML code (not how many are displayed in your browser). Maybe a
reasonable sample of the HTML text would be sufficient to get a
start. But, without that, there is hard to say where to start.

Tom Lavedas
***********
 
Back
Top