G
G. Stewart
The objective is to extract the first n characters of text from an
HTML block. I wish to preserve all HTML (links, formatting etc.), and
at the same time, extend the size of the block to ensure that all
closing tags are recovered.
For example, simply extracting the first 400 characters of a HTML
block may result in an <i> opening tag being including, but its
closing tag being excluding. Or a link may get chopped halfway - [...
blah blah <a href="ht] may be the last few characters of the recovered
phrase.
Ideally, if any html opening tag is included in the first n
characters, then any number of extra characters should continue to be
extracted from the source block until all paired closing tags are
found.
We can assume that the source block is well-formed HTML, and every
opening tag has a closing tag (whether optional or not). Furthermore
(if it makes any difference), we can assume that all tags are given in
their simplest forms with no attributes (e.g. <p>, <ul>, <li>, <b>),
except for anchor tags, which have the href attribute of course.
Can anyone suggest a regular expression to do this?
HTML block. I wish to preserve all HTML (links, formatting etc.), and
at the same time, extend the size of the block to ensure that all
closing tags are recovered.
For example, simply extracting the first 400 characters of a HTML
block may result in an <i> opening tag being including, but its
closing tag being excluding. Or a link may get chopped halfway - [...
blah blah <a href="ht] may be the last few characters of the recovered
phrase.
Ideally, if any html opening tag is included in the first n
characters, then any number of extra characters should continue to be
extracted from the source block until all paired closing tags are
found.
We can assume that the source block is well-formed HTML, and every
opening tag has a closing tag (whether optional or not). Furthermore
(if it makes any difference), we can assume that all tags are given in
their simplest forms with no attributes (e.g. <p>, <ul>, <li>, <b>),
except for anchor tags, which have the href attribute of course.
Can anyone suggest a regular expression to do this?