Importing huge text files

  • Thread starter Thread starter Helge V. Larsen
  • Start date Start date
H

Helge V. Larsen

I have several huge semicolon delimited text files that I import to tables,
one table per file. The files are of size 200 - 300 kB and have more than
1,000,000 lines each. My problem is that it takes too much time to do the
import. I am importing by using DoCmd.TransferText. (I am aware of my
potential problem of max. database size of 2 GB.)

The tables have primary keys, have indexes, and have relations to other
tables. I have seen that things are speeded up considerably if I remove
keys, indexes and relations - but I do not like to do this.

I have read somewhere that things would be faster if I used transactions.
BeginTrans before importing and CommitTrans after importing (because then
indexing etc. is not done per record, but when everything is imported). But
BeginTrans/CommitTrans does not function in connection with
DoCmd.TransferText - as far as I can see.

How can I speed things up? Right now it takes hours to import data!
 
This is a suggestion:

Try importing each file into a new (empty) table with no keys,
relationships, or indexes. Then run an Append query to copy the
records from the new table to your existing table. I don't know
if this will speed things up, but I'm guessing that query
execution might be faster.

Regards
Geoff




"Helge V. Larsen"
 
I had already tried this. But the time to append data is similar to the time
needed to import to a table with keys, relations and relations.
 
If what you've read is right (that using transactions will speed
things up), then here's some sample code you could adapt. It's
one solution of a number of possibilities you could go for. This
solution still runs an append query but within a transaction.
I'd be interested to know if it actually does speed things up. I
don't think you can run a TransferText within a transaction.

I've assumed:

1. that you've created a linked table, called "NewData", to your
semicolon delimited text file;
2. that the text file has field names in the first row; and
3. that the table into which the data is to be imported is called
"OldData".


Copy this code to a new module:


Option Compare Database
Option Explicit

' This module needs a reference to
' Microsoft DAO 3.6. To set the reference:
' 1. In VBA editor, open Tools | References.
' 2. Select Microsoft DAO 3.6.

Private mobjDBE As DAO.DBEngine
Private mobjWS As DAO.Workspace
Private mobjDB As DAO.Database

Private RetVal As VbMsgBoxResult


Private Sub AddData()

On Error GoTo Error_AddData

RetVal = MsgBox("Start?", vbYesNo + vbQuestion)
If RetVal <> vbYes Then
GoTo Exit_AddData
End If

Call InitializeVariables
mobjWS.BeginTrans
With mobjDB
.Execute "INSERT INTO OldData " _
& "SELECT NewData.* " _
& "FROM NewData;", dbFailOnError
End With
' If no error occurred, commit tranaction:
mobjWS.CommitTrans
MsgBox "Data Appended.", vbInformation

Exit_AddData:

Call DestroyObjects
Exit Sub

Error_AddData:

MsgBox "An error occurred while appending data. " _
& "Sort out problems and repeat transaction.", _
vbOKOnly + vbExclamation, "Error"
mobjWS.Rollback
Resume Exit_AddData

End Sub

Private Sub InitializeVariables()

Set mobjDBE = DAO.DBEngine
Set mobjWS = mobjDBE.Workspaces(0)
Set mobjDB = CurrentDb()

End Sub

Private Sub DestroyObjects()

Set mobjDB = Nothing
Set mobjWS = Nothing
Set mobjDBE = Nothing

End Sub


Regards
Geoff




"Helge V. Larsen"
 
Sorry - I should have written:

I have attached an Excel file with the results of my timing experiments. I
have shown the relations involved.

(The posting was too big for the server when I tried to include JPG files.)
 
I had a hunch that transactions wouldn't speed things up much;
after all, the purpose of transactions is to group changes to
data so that a group of transactions are either all performed and
completed together as one unit or are all discarded if an error
occurs. (Apparently, transactions may save some time on disc
access.) However, it's probably just as well you've confirmed
the timings in your case.

I've found some code that will do what you want (and more). It
needs some modification to remove the functionality you don't
need and it's proving a little stubborn! Well, not really; it's
just that I think it would be better as a series of class
modules, which I'm in the process of creating.

In the meantime, I have a few questions.

Are you sure that temporarily removing the index on the primary
key, removing relationships and removing all other indexes on a
table won't cause problems? I realise that the possibility of
problems depends on the data you will be importing. But I'm
thinking along these lines:

1. Say the data you are importing contains a primary key that
duplicates an existing primary key. Your method won't pick this
up during the import; you'll find this out when you reapply the
primary key index, by which time your data could be in a very
nasty mess.

2. Say the data you are importing breaks referential
integrity; that is to say, data you are importing doesn't have
required relationships with other tables. Again, your method
won't pick this up during the import, but when you reapply the
relationships.

Are you happy that these sort of problems won't occur? If not,
wouldn't you be better off letting Microsoft Access do these
checks as the data is imported?

I think you have probably considered these, and similar, issues
already, but I'm just confirming with you that you have.

As far as the primary-key field is concerned, am I right in
thinking that the primary-key field already exists in the table
into which you are importing data? Will the new data you are
importing have that primary-key field already populated with
data? Or is the primary-key field in the table an
auto-incrementing long-integer field?

By the way, in your first post you said that the complete set of
import operations were taking hours; whereas, the tests you did
on one table took a maximum of about 45 seconds. Are you
repeating the same operation many times? If not, 45 seconds
doesn't seem a bad price to pay if - and I emphasise if - the
time you're attempting to save might mess up your data. How many
import operations are you doing?

I await your reply on the above issues. In the meantime, I'll
carry on adapting the code I've found.

Regards
Geoff




"Helge V. Larsen"
 
This is very odd!

I'm replying to you in the newsgroup:
office.developer.vba

And I see you've apparently cross-posted in:
access.modulesdaovba

where my posts are also turning up without me knowing it, and
where the posts of others are not turning up, so I don't know the
full extent of the conversation. I don't see the point of
replying to a post if others have already answered it!

Regards
Geoff




"Helge V. Larsen"
 
Geoff, what newsreader are you using? Are you using a Web
interface instead of a newsreader? Normally your newsreader
will show which groups you are posting to. The received message
here on the MS news server shows that you posted to

microsoft.public.access.modulesdaovba,
microsoft.public.access.modulesdaovba.ado,
microsoft.public.office.developer.vba

Sometimes people cut off some of the newsgroups before posting,
so that their replies are only seen in some of the groups. It may
be that someone in devoloper.vba has decided not to include the
Access groups in their reply? Or it may be that some of the
messages have just not propagated all the way through the network.

One reason why people sometimes cut groups out of their
replies is because their newsreader has warned them that it
does not recognise the group name. This does not matter!
As long as the group name is included with the message, as
you did, it will be included in the correct groups when it reaches
the MS server.


However, in this case, ALL of the messages on the MS server
are the same in developer.vba and in modulesdaovba. Which
news server are you connecting to? Perhaps the Web interface
does not show the details of the message you are posting?

(david)
 
David, Many thanks for your reply.
what newsreader are you using?

Outlook Express 6.00.2900.5512 (xpsp.080413-2105)

As far as this thread is concerned ("Importing Huge Text Files"),
I have only been posting to:

office.developer.vba

I am not subscribed to:
microsoft.public.access.modulesdaovba.ado,

so was surprised by your reply to find that my posts were turning
up there (as well as in the other two groups).
It may be that someone in devoloper.vba has
decided not to include the Access groups in
their reply? Or it may be that some of the
messages have just not propagated all the
way through the network.

I'm not sure I understand. Let me tell you what I see in the
three groups on the thread "Importing Huge Text Files":

In:
office.developer.vba (where I've been posting):

I see 9 posts from:

Helge V. Larson (5/11/2008 09:30) (Thread starts)
... (other posts here - see other newsgroups below)
GeoffG (6/11/2008 08:58)
Helge V. Larson (6/11/2008 10:38)
GeoffG (6/11/2008 23:45)
Helge V. Larson (7/11/2008 13:17)
Helge V. Larson (7/11/2008 13:20)
GeoffG (10/11/2008 09:08)
GeoffG (11/11/2008 01:26)
David (11/11/2008 01:52)

In:
access.modulesdaovba (where my posts have been copied to):

I see 13 posts. All of the above posts and (where the ellipsis
appears above) I also see these:

Helge V. Larson (5/11/2008 09:30) (Thread starts)
Helge V. Larson (5/11/2008 09:34)
John W. Vinson (5/11/2008 21:19)
Helge V. Larson (6/11/2008 10:41)
John W. Vinson (6/11/2008 21:14)

In:
access.modulesdaovba.ado

I see the same 13 posts as in
access.modulesdaovba

To pick up the first point you make (above), it doesn't seem the
Helge V. Larson decided not to include the Access groups, as all
his/her messages appear in all three groups. If it is the case
that some message from Helge and John didn't propagate to:
office.developer.vba
why did this happen?
However, in this case, ALL of the messages
on the MS server are the same in developer.vba
and in modulesdaovba. Which news server are
you connecting to? Perhaps the Web interface
does not show the details of the message you
are posting?

I am connecting to msnews.microsoft.com. I am not sure what you
mean by "Web interface" but let me say this: I noticed that
another conversation I was contributing to had been replied to by
other people (because the starter of the thread thanked people
whose contributions I couldn't see). When I went to:

http://www.microsoft.com/communitie...9ea-9578-4a84-8098-fbf563e6454d&lang=en&cr=us

and other such listings on www.microsoft.com, I could see the
other postings from other people in the thread I'd been
contributing to.

This prompted me to start the following conversation:

In:
outlookexpress.general
Subject:
OE Not Downloading All Posts in Thread
Date:
2/11/2008 23:12

I've not yet fully understood all the replies to that post. I
was hoping to glean an understanding of what, if anything, I
could do in Outlook Express to ensure that I see all
contributions to a conversation. At present, it seems that I
can't. Worse, it seems that, depending on how some MVP's
operate, I may or may not see all posts to a conversation.

Although I'd like to share what little expertise I have with
others who may be struggling, I just can't see the point of
wasting my time if a question has been answered already. It
seems to me that, either the management of the Newsgroups is the
problem, or Outlook Express is the problem, but whatever is the
problem, I need some solution that makes contributing to
newsgroups a worthwhile experience.

I'd be glad to hear your views or the views of anyone else.

Regards
Geoff
 
Please read this post together with my post that immediately
precedes it.

In my previous post, I said:
If it is the case that some message from Helge and John didn't
propagate to:
office.developer.vba
why did this happen?

I've thought about the above further. You said:

Two points:

1. Four of Helge and John's posts (indicated in my previous
post) did not download into Outlook Express.

2. Those same four posts are not listed on:

http://www.microsoft.com/communitie...9ea-9578-4a84-8098-fbf563e6454d&lang=en&cr=us

Now, you said that ALL messages on the MS server are the same in
two of the newsgroups; especially the newsgroup I'm contributing
to, office.developer.vba. But the above web page, which is on
www.microsoft.com, and which lists office.developer.vba, doesn't
show the four posts that are missing from Outlook Express.

How can this be? How can the MS server tell you that all posts
are present in two of the newsgroups (or did you mean all
three?), yet Microsoft's own web page above not show all of them?
Does this explain why Outlook Express is not downloading them?

As before, whatever the explanation, this is a nuisance.

Glad to hear views/solution.
Regards
Geoff
 
Ok, I see what you are talking about, and since it was the fault of the
original poster, I agree with your response.
I have only been posting to: office.developer.vba
I am not subscribed to:

Firstly, it doesn't matter what you are subscribed to. Your subscription
controls what messages you get, not what groups you send to. When
you send a message, you can send it to any group that the server hosts
or is linked to. The message protocol doesn't keep track of which groups
you are subscribed to, or which group you were looking at when you
replied.

Secondly, the message identification string uniquely identifies the message
you are replying to. The message service will only have one copy of that
message, and your reply will be linked to that message. Different message
servers can handle this in different ways.

Normally, you should expect that you are posting to all of the groups that
are listed on the "Newsgroups" line of the message you are posting, even
if you've never heard of them and are unable to subscribe to them.

Conversely, if you cut groups out of that line, you should expect that your
message will not go to the groups you cut out.

However, sometimes you will find that some provider is using a different
system. For example, in Google you used to find unrelated messages
linked together because they had the same subject line.

The protocol used to link servers together is just as basic as the protocol
you use to connect to your mail server, and doesn't even allow the
subscribing server to control which groups are linked. Also, it is a
store-and-forward protocol, so it may take some time for a message
to propagate around the world. I 'm sorry that I didn't wait longer
before posting.


(david)
 
David, Very many thanks for your explanation.

I have to confess that I do see that the original poster's
message was sent to all three newsgroups. (I'll pay more
attention to that in future.)

I can well understand the point you make that, if a responder
chooses to delete one (or more) of the automatically-listed
newsgroups from his/her response, then it's evident the system
will send it as requested. Nevertheless, the result is (as you
know) confusion all round. Many people like me, probably
including many original posters, won't realise that this is
what's going on and will end up wasting time.

You didn't touch on this and I'd be glad to hear your views: I
don't understand the rationale of someone wanting to delete a
newsgroup from their response; especially, an MVP, who seemingly
would be more aware of the potential confusion that could arise
in those, like me, who are less informed of the system's
workings. If you, or anyone else, knows of any good reasons why
someone might want to remove newsgroups from their responses, I'd
be very glad to hear.

Couldn't Microsoft make it a policy for MVPs to reply to all the
newsgroups the originator posted to (with perhaps an appropriate
recommendation not to cross-post in future)? I say this because
I assume MVPs are more likely to have the right answers and,
therefore, those answers are more likely to inhibit inappropriate
responses from others. (Just a thought.)

In any event, very many thanks for pointing out how the system
works; I'll be more wary in future.

For Helge's attention: Helge, if this exchange has inhibited
your contribution, please don't let it! If you still need help,
post back. I'm happy to adjust the code sample I have if you
need it. (But you may want to pay some attention to the downside
I mentioned.)

Regards
Geoff
 
Normally MVP's only trim group lists to remove clearly
inapropriate groups, and then only when responding to
someone who didn't know any better, so in effect only
when responding to questions that have obvious and
well-known answers, which will get asked again by
somebody else in a couple of days anyway....

In this case, the problem wasn't that an MVP had trimmed
the list. The original poster posted replies using different lists.
I didn't see all the postings when I responded to you: that was
a timing issue.

John Vinson MVP responded to one of a sequence of messages,
and unfortunately the message he responded to did not include
microsoft.public.office.developer.vba, but that was an error in
the posting he was responding to, not a deliberate choice by him.

The problem was that the original poster was multi-posting in
Microsoft.public.Access
and in
Microsoft.public.office.developer.vba

You were correct to (politely) point out the problems this causes.

The original poster was also cross-posting both messages to
microsoft.public.access.modulesdaovba.ado

This was unnecessary -- there wasn't any ADO in his question,
but it doesn't cause problems like multi-posting does.

Normally when someone multi-posts, each message is new.

This particular example is different, because all of the multiple
messages posted in different groups appear to be replies or
copies of his first post. It is not clear how this happened: it is
possible that there was a glitch in the way the original message
was handled when he tried to change the group list.

(david)
 
David, Very many thanks for your further thoughts.

After my last post, it occurred to me (during yesterday) that I
had assumed (incorrectly) that John had trimmed the newsgroups
that he'd responded to and that I should have checked each post
to see what exactly had happened. In the meantime, you have
kindly discovered the nature of the original poster's postings.
This particular example is different, because all of the
multiple messages posted in different groups appear to be
replies or copies of his first post. It is not clear how this
happened: it is possible that there was a glitch in the way the
original message was handled when he tried to change the group
list.

Perhaps the original poster's second post replied to their first
post and then, for reasons that remain unknown, the original
poster decided to change the newsgroup listing. It's water under
the bridge and I'm straying into the realms of "Who knows?" and
"Who cares?".

Very many thanks for your comprehensive look into this issue. As
a consequence, you've clarified (excellently) for me the way
newsgroups work. I'm not sure I'll reply to posters who've
cross-posted in future simply because I'll probably not be
bothered to track down their cross-postings to see if the issue
has been answered. I can appreciate why MVP replies are often
succinct until the original poster asks for more. In the
meantime, I'll just have to learn to forgive posters for not
knowing how the system works; I didn't! And, if Microsoft can
ask MVPs to not trim newsgroup listings (not that that happened
in this case) then that'd be a plus to those of us who'd like to
see the whole conversation.

By the way, the originator also cross-posted to:
public.access
which I hadn't noticed before.

When I mark the conversation as read in one of the newsgroups,
all the cross-posted newsgroups also become marked as read.
There's obviously some kind of link between newsgroups. (Muse...
Wouldn't it be great if that could be flagged or hyperlinked in
some way?...)

Very many thanks again.
Regards.
Geoff
 
Back
Top