XML and Word Docs

  • Thread starter Thread starter Neil
  • Start date Start date
N

Neil

An article at http://news.com.com/2100-1012-991694.html?tag=fd_top states:
"XML [in Office 2003] would allow easier interchange of data generated in
Office documents with back-end systems or existing Web services."

As part of an Access 2000 application, I have to continually parse Word
documents and store the parsings in Access tables using Automation to
control Word and parse the document. Is there a way that XML would help with
that?

Thanks.
 
Check out the articals on wml, which is the type of xml that word uses.
(wml = Word processing eXended mark-up Language, for what's it's
worth.)

I've played a bit with WML and Access, and I can't say as it's any
better (or worse) then writting or reading HTML or any other form of
XML, but it's interesting (and a bit of fun) to write a Word document
without having to use RTF or having word installed anywhere. In
addition, WML is a *heck* of a lot easier to work with then RTF!

Google "WML" and you'll probably find a bunch of information on it.
That's how I started, but sadly, it appears that I didn't keep the
links. I think I started out at xml.org, but don''t quote me on that.
 
Not a bit actually. Users are unlikely to save their Word documents as
WML files just so you can parse them.
 
I reckon it depends on your word documents. If they are highly structured
then xml could help. If they are basically unstructured then xml will not
help.

The RSS schema is a useful example. I sometimes create RSS files in Word.
RSS files can be fairly weakly structured if long passages of text are
embedded between <Description></Description> tags. XML is still useful for
me because the Description tag corresponds one-to-one with a column in my
database.

Assuming your documents pass the structure test, the key thing is whether
you can control the document creation process. If you can get the authors to
create their documents in xml, parsing it is much easier and robust than
parsing regular text. You can use XML schema to enforce validity and
well-formedness; you can use types from the Xml namespace in the framework
class library; and you can use xslt to transform from one format to another.
 
I don't quite agree. I get the impression that the docx format will make
parsing of unstructured documents easier, if only by making it easier to
bring a heavy-duty regex engine to bear. That said, "easier" may just
mean the difference between impossible and not-quite-so-impossible<g>.

I reckon it depends on your word documents. If they are highly structured
then xml could help. If they are basically unstructured then xml will not
help.

The RSS schema is a useful example. I sometimes create RSS files in Word.
RSS files can be fairly weakly structured if long passages of text are
embedded between <Description></Description> tags. XML is still useful for
me because the Description tag corresponds one-to-one with a column in my
database.

Assuming your documents pass the structure test, the key thing is whether
you can control the document creation process. If you can get the authors to
create their documents in xml, parsing it is much easier and robust than
parsing regular text. You can use XML schema to enforce validity and
well-formedness; you can use types from the Xml namespace in the framework
class library; and you can use xslt to transform from one format to another.






Neil said:
An article at http://news.com.com/2100-1012-991694.html?tag=fd_top states:
"XML [in Office 2003] would allow easier interchange of data generated in
Office documents with back-end systems or existing Web services."

As part of an Access 2000 application, I have to continually parse Word
documents and store the parsings in Access tables using Automation to
control Word and parse the document. Is there a way that XML would help
with that?

Thanks.
 
I have to agree with you John. But then again, as the original poster
mentioned, it won't help him out a bit in his application since the
user's probably won't be saving their documents as WML files. And
unless they're using a version above Word 2000, they won't be saving
WML documents at all!

So, what's the point? Everyone needs to upgrade? <sigh> And what if
they _are_ using Word 2003? How's that going to help Neil out in
getting the data into an Access table?


John said:
I don't quite agree. I get the impression that the docx format will make
parsing of unstructured documents easier, if only by making it easier to
bring a heavy-duty regex engine to bear. That said, "easier" may just
mean the difference between impossible and not-quite-so-impossible<g>.

I reckon it depends on your word documents. If they are highly structured
then xml could help. If they are basically unstructured then xml will not
help.
The RSS schema is a useful example. I sometimes create RSS files in Word.
RSS files can be fairly weakly structured if long passages of text are
embedded between <Description></Description> tags. XML is still useful for
me because the Description tag corresponds one-to-one with a column in my
database.
Assuming your documents pass the structure test, the key thing is whether
you can control the document creation process. If you can get the authors to
create their documents in xml, parsing it is much easier and robust than
parsing regular text. You can use XML schema to enforce validity and
well-formedness; you can use types from the Xml namespace in the framework
class library; and you can use xslt to transform from one format to another.
Neil said:
An article at http://news.com.com/2100-1012-991694.html?tag=fd_top states:
"XML [in Office 2003] would allow easier interchange of data generated in
Office documents with back-end systems or existing Web services."
As part of an Access 2000 application, I have to continually parse Word
documents and store the parsings in Access tables using Automation to
control Word and parse the document. Is there a way that XML would help
with that?
 
I'm a bit confused by WML in this context -Wireless Markup Language??

I think the important point here is that Word 2003 can save 2 types of XML:
there is the default specialized sort (docx??) which looks incomprehensible
if you view it in NotePad and thus difficult to parse, and then there is the
nice and simple, highly parsable standard sort when you create the document
with an imported schema template and click the 'save data only option' when
you save.

If Neil controls the doc creation process, the data is structured and
everyone uses Word 2003, he can make it equally easy for users to create
standard XML files as native Word format docs. XML data and relational data
are interchangeable (though I'm not too familiar with the specific
capabilities of Access).

If none of the tests succeed he has to stick with old fashioned regex
parsing of regular text. The specialized form of XML cannot help because the
markup is document smart not content smart.

-richard

Chuck Grimsby said:
I have to agree with you John. But then again, as the original poster
mentioned, it won't help him out a bit in his application since the
user's probably won't be saving their documents as WML files. And
unless they're using a version above Word 2000, they won't be saving
WML documents at all!

So, what's the point? Everyone needs to upgrade? <sigh> And what if
they _are_ using Word 2003? How's that going to help Neil out in
getting the data into an Access table?


John said:
I don't quite agree. I get the impression that the docx format will make
parsing of unstructured documents easier, if only by making it easier to
bring a heavy-duty regex engine to bear. That said, "easier" may just
mean the difference between impossible and not-quite-so-impossible<g>.

I reckon it depends on your word documents. If they are highly
structured
then xml could help. If they are basically unstructured then xml will
not
help.
The RSS schema is a useful example. I sometimes create RSS files in
Word.
RSS files can be fairly weakly structured if long passages of text are
embedded between <Description></Description> tags. XML is still useful
for
me because the Description tag corresponds one-to-one with a column in
my
database.
Assuming your documents pass the structure test, the key thing is
whether
you can control the document creation process. If you can get the
authors to
create their documents in xml, parsing it is much easier and robust than
parsing regular text. You can use XML schema to enforce validity and
well-formedness; you can use types from the Xml namespace in the
framework
class library; and you can use xslt to transform from one format to
another.
An article at http://news.com.com/2100-1012-991694.html?tag=fd_top
states:
"XML [in Office 2003] would allow easier interchange of data generated
in
Office documents with back-end systems or existing Web services."
As part of an Access 2000 application, I have to continually parse
Word
documents and store the parsings in Access tables using Automation to
control Word and parse the document. Is there a way that XML would
help
with that?
 
WML = "Word processing eXtensible Markup" or Word processing XML.

By the way, you don't need a "template" to create a WML file. You need
a DTD, but (just as with HTML) there is a default set, and there's the
one from Microsoft that is automatically referenced on save, just as
when you save a word document as HTML.


Richard said:
I'm a bit confused by WML in this context -Wireless Markup Language??

I think the important point here is that Word 2003 can save 2 types of XML:
there is the default specialized sort (docx??) which looks incomprehensible
if you view it in NotePad and thus difficult to parse, and then there is the
nice and simple, highly parsable standard sort when you create the document
with an imported schema template and click the 'save data only option' when
you save.

If Neil controls the doc creation process, the data is structured and
everyone uses Word 2003, he can make it equally easy for users to create
standard XML files as native Word format docs. XML data and relational data
are interchangeable (though I'm not too familiar with the specific
capabilities of Access).

If none of the tests succeed he has to stick with old fashioned regex
parsing of regular text. The specialized form of XML cannot help because the
markup is document smart not content smart.

-richard

Chuck Grimsby said:
I have to agree with you John. But then again, as the original poster
mentioned, it won't help him out a bit in his application since the
user's probably won't be saving their documents as WML files. And
unless they're using a version above Word 2000, they won't be saving
WML documents at all!

So, what's the point? Everyone needs to upgrade? <sigh> And what if
they _are_ using Word 2003? How's that going to help Neil out in
getting the data into an Access table?


John said:
I don't quite agree. I get the impression that the docx format will make
parsing of unstructured documents easier, if only by making it easier to
bring a heavy-duty regex engine to bear. That said, "easier" may just
mean the difference between impossible and not-quite-so-impossible<g>.

I reckon it depends on your word documents. If they are highly
structured
then xml could help. If they are basically unstructured then xml will
not
help.
The RSS schema is a useful example. I sometimes create RSS files in
Word.
RSS files can be fairly weakly structured if long passages of text are
embedded between <Description></Description> tags. XML is still useful
for
me because the Description tag corresponds one-to-one with a column in
my
database.
Assuming your documents pass the structure test, the key thing is
whether
you can control the document creation process. If you can get the
authors to
create their documents in xml, parsing it is much easier and robust than
parsing regular text. You can use XML schema to enforce validity and
well-formedness; you can use types from the Xml namespace in the
framework
class library; and you can use xslt to transform from one format to
another.
An article at http://news.com.com/2100-1012-991694.html?tag=fd_top
states:
"XML [in Office 2003] would allow easier interchange of data generated
in
Office documents with back-end systems or existing Web services."
As part of an Access 2000 application, I have to continually parse
Word
documents and store the parsings in Access tables using Automation to
control Word and parse the document. Is there a way that XML would
help
with that?
 
Please excuse my clumsy terminology, I'm not familiar with the Word SDK. I
think what I described as the 'native' sort of XML is called
WordProcessingML in the SDK which I guess in another term for "Word
processing eXtensible Markup".

Thinking about it a bit more, I think WPML does help the slicing and dicing
process a bit when document content is semantically unstructured. It gives
you the option of using XML tools and techniques for intelligent parsing:
you are not compelled to use the Word SDK to hunt for patterns in document
elements such as paragraphs and formatting that provide the clues for the
existence of interesting data fragments.

I have suddenly discovered that I have a similar business need to the
original post. Its time I boned up on the Word SDK / WPML.

- Richard

Chuck Grimsby said:
WML = or Word processing XML.

By the way, you don't need a "template" to create a WML file. You need
a DTD, but (just as with HTML) there is a default set, and there's the
one from Microsoft that is automatically referenced on save, just as
when you save a word document as HTML.


Richard said:
I'm a bit confused by WML in this context -Wireless Markup Language??

I think the important point here is that Word 2003 can save 2 types of
XML:
there is the default specialized sort (docx??) which looks
incomprehensible
if you view it in NotePad and thus difficult to parse, and then there is
the
nice and simple, highly parsable standard sort when you create the
document
with an imported schema template and click the 'save data only option'
when
you save.

If Neil controls the doc creation process, the data is structured and
everyone uses Word 2003, he can make it equally easy for users to create
standard XML files as native Word format docs. XML data and relational
data
are interchangeable (though I'm not too familiar with the specific
capabilities of Access).

If none of the tests succeed he has to stick with old fashioned regex
parsing of regular text. The specialized form of XML cannot help because
the
markup is document smart not content smart.

-richard

Chuck Grimsby said:
I have to agree with you John. But then again, as the original poster
mentioned, it won't help him out a bit in his application since the
user's probably won't be saving their documents as WML files. And
unless they're using a version above Word 2000, they won't be saving
WML documents at all!

So, what's the point? Everyone needs to upgrade? <sigh> And what if
they _are_ using Word 2003? How's that going to help Neil out in
getting the data into an Access table?


John Nurick wrote:
I don't quite agree. I get the impression that the docx format will
make
parsing of unstructured documents easier, if only by making it easier
to
bring a heavy-duty regex engine to bear. That said, "easier" may just
mean the difference between impossible and not-quite-so-impossible<g>.

On Wed, 29 Jun 2005 19:44:06 +0100, "Richard P"
<[email protected]>
wrote:
I reckon it depends on your word documents. If they are highly
structured
then xml could help. If they are basically unstructured then xml will
not
help.
The RSS schema is a useful example. I sometimes create RSS files in
Word.
RSS files can be fairly weakly structured if long passages of text
are
embedded between <Description></Description> tags. XML is still
useful
for
me because the Description tag corresponds one-to-one with a column
in
my
database.
Assuming your documents pass the structure test, the key thing is
whether
you can control the document creation process. If you can get the
authors to
create their documents in xml, parsing it is much easier and robust
than
parsing regular text. You can use XML schema to enforce validity and
well-formedness; you can use types from the Xml namespace in the
framework
class library; and you can use xslt to transform from one format to
another.

An article at http://news.com.com/2100-1012-991694.html?tag=fd_top
states:
"XML [in Office 2003] would allow easier interchange of data
generated
in
Office documents with back-end systems or existing Web services."
As part of an Access 2000 application, I have to continually parse
Word
documents and store the parsings in Access tables using Automation
to
control Word and parse the document. Is there a way that XML would
help
with that?
 
<[email protected]>
<[email protected]>
<[email protected]>

<u#[email protected]>
<[email protected]>
Newsgroups: comp.databases.ms-access,microsoft.public.access.externaldata,microsoft.public.access.interopoledde,microsoft.public.office.developer.automation,microsoft.public.office.developer.officedev.other
NNTP-Posting-Host: 219.195.76.83.cust.bluewin.ch 83.76.195.219
Path: number1.nntp.dca.giganews.com!border1.nntp.dca.giganews.com!nntp.giganews.com!news.maxwell.syr.edu!msrn-out!msrtrans!TK2MSFTNGP08.phx.gbl!TK2MSFTNGP09.phx.gbl
Lines: 1
Xref: number1.nntp.dca.giganews.com comp.databases.ms-access:826977 microsoft.public.access.externaldata:60553 microsoft.public.access.interopoledde:10133 microsoft.public.office.developer.automation:10073 microsoft.public.office.developer.officedev.other:4275

Hi Chuck,

Actually, you don't need a DTD, you need a schema...
By the way, you don't need a "template" to create a WML file. You need
a DTD, but (just as with HTML) there is a default set, and there's the
one from Microsoft that is automatically referenced on save, just as
when you save a word document as HTML.

Cindy Meister
INTER-Solutions, Switzerland
http://homepage.swissonline.ch/cindymeister (last update Jun 8 2004)
http://www.word.mvps.org

This reply is posted in the Newsgroup; please post any follow question or
reply in the newsgroup and not by e-mail :-)
 
Back
Top