Shrink Express DB After Large Data Import

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

Hello, I have a small VB.Net application that I've built in VS2005 that
indexes and helps me to work with various collections of files on my network.
It works great at first, especially with small collections, but after a time,
or immediately after import of a large collection, the application becomes
unresponsive due to the size of the database and the log.

In regards to the log, I really don't need it -- I just want to slurp in my
data and run sql queries against it... the database would never need to be
recovered.

I need either VB.Net code that can perform maintenance functions on the
database, or else a way to configure the database to not store logs, or
both...

Or some other solution? Granted, this may not be the best application of SQL
server, but I need the experience with it, and it's giving me some good
insight into the types of problems and limits that these databases encounter.

Any help is greatly appreciated!

--Jon
--(CoastalData)
 
coastaldata said:
Hello, I have a small VB.Net application that I've built in VS2005 that
indexes and helps me to work with various collections of files on my
network.
It works great at first, especially with small collections, but after a
time,
or immediately after import of a large collection, the application becomes
unresponsive due to the size of the database and the log.

In regards to the log, I really don't need it -- I just want to slurp in
my
data and run sql queries against it... the database would never need to be
recovered.

I need either VB.Net code that can perform maintenance functions on the
database, or else a way to configure the database to not store logs, or
both...

Or some other solution? Granted, this may not be the best application of
SQL
server, but I need the experience with it, and it's giving me some good
insight into the types of problems and limits that these databases
encounter.

Any help is greatly appreciated!

--Jon
--(CoastalData)

You have installed SQL Server Management Studio Express ?
Right click on your database, chose properties, options, set the database
recovery model to Simple and Autoshrink to True
ThoraD
 
Well what do you know? I'd set this already, but apparently what was
happening was sometimes either the program crashed during development,or I
simply stopped debugging, and the server was not able to perform those shrink
tasks.

Thanks!
 
Ok, well, I just tried it again, though, and here's what I ended up with
after a trial run and clean shutdown:

Imported 315,406 records at approx 26 recs/sec,
MDF File size: 87.1 MB
Log File size: 344 MB!!!!!

Now the next time the app opens it will take a long time to initialize the
data (using bindings) and may timeout, even though I've extended the "Connect
Timeout" value for the connection.

I've read a couple of articles that talk about whacking the log files, and a
couple of more articles about why you shouldn't do that, so what *should* I
do?

Said another way, "What is the RIGHT way to manually initiate maintenance
tasks on a database that might never ever be connected to Management Studio
from within VB.Net?"

Can I copy some of the sp_ sp's over from the system database and execute
them from within the database needing work, or do I need a second database?

Any thoughts, advice and/or code greatly appreciated!!!

--Jon
 
Hello, I have a small VB.Net application that I've built in VS2005 that
indexes and helps me to work with various collections of files on my network.
It works great at first, especially with small collections, but after a time,
or immediately after import of a large collection, the application becomes
unresponsive due to the size of the database and the log.

In regards to the log, I really don't need it -- I just want to slurp in my
data and run sql queries against it... the database would never need to be
recovered.

I need either VB.Net code that can perform maintenance functions on the
database, or else a way to configure the database to not store logs, or
both...

Or some other solution? Granted, this may not be the best application of SQL
server, but I need the experience with it, and it's giving me some good
insight into the types of problems and limits that these databases encounter.

Any help is greatly appreciated!

--Jon
--(CoastalData)

I'm wondering why you don't ask this question at
microsoft.public.sqlserver.server? The SQL Server expert ratio is probably much
higher there than here.

Good luck with your project,

Otis Mukinfus
http://www.arltex.com
http://www.tomchilders.com
 
When you were talking about 'large collections' I assumed uou were talking
about 10's or even 100's of millions of rows.

I would suggest that you've got other serious issues that are not related to
the resulting size of either the .mdf or .ldf files.

When you database is 'clean', and by that I mean has just been created and
all the tables and other database objects have been created and no data has
been inserted, how big are the .mdf and .ldf files respectively. In other
words what sizes are specified at database creation and what are the
expansion rules. If the sizes are small then during the 'load' a significant
amount of time will be spent increasing the file sizes as they become full.

Even with that taken into account, I'm surprised that you can only load
about 26 rows per second. I would expect somewhere in excess of 100 rows per
second and I would be looking for at least 1000 rows per second. Maybe you
need to look at your 'load' logic for bottlenecks and/or superfluous logic.

Whether or not you want a transaction log file and regardless of what
recovery model you are using, the transaction log file MUST be available or
SQl Server will not work. That is immutable.

When you app access the data, there are a number of distinct steps.

The first step is to connect to the database. If this step takes longer that
the default connection timeout (30 seconds if I remember rightly) then you
have major problems like the machine that Sql Server is running on is not
accessable, Sql Server is not running at all, the database doesn't exist,
etc.

The next strp is to execute some SQL statement or other againt the database.
The default timeout for this is 30 seconds and has absolutely nothing to do
with the connection timeout. If you need to adjust the timeout at this point
then you need to be tweaking the command timeout.

Now we come to the SQL statement you are executing. How many rows do you
expect to be retrieving? The more you retrieve the longer it will take, it
is as simple as that. Also, how complex is the statement? The more complex
it is the longer it will take for Sql Server to execute it.

If you could give some idea of what your schema looks like and what the SQL
Statement is then perhaps we can advise you how to streamline it.

The bit I am most concerned about is 'whacking the log files'. To me
'whacking the log files' would mean deleting them and anybody who recommends
this is talking through his fundamental orifice.

If, as you say, that once it is loaded, you will be treating the database as
if it were read-only. then you can reduce the size of the .mdf and .ldf
files, by 'shrinking' them. To do this, execute the following T-SQL
statements in sequence:

use master
backup database <database> to disk='<path>\<database>.bak' with
description='Full backup of <database>',init,skip
backup log <database> with truncate_only
use <database>
dbcc shrinkfile(<database>,50)
dbcc shrinkfile(<log>,10)
use master
backup database <database> to disk='<path>\<database>.bak' with
description='Full backup of <databasen>',init,skip

Substite the appropriate values for <database>, <path> and <log>.

If you want to you can delete the resultant .bak file once the process is
finished.

You may find that the .mdf shrinks below 87MB but not as low as 50MB.
Whatever it does shrink to is the minimum required to hold the data.

You may find that the .ldf shrinks significantly but not as low as 10MB.
This is to do with how Sql Server deals with transaction logs. Check out the
Sql Server BOL if you want to know more about this.

Repeating the process immediately may result in the .ldf being shrunk even
more.

During normal use, evenen if you are only executing 'select' queries, the
size of the files will increase by a small amount. The 'shrink' process can
be run periodicaly to counter this, even when the database is 'in use'.
 
Excellent answer, thank you! Right off the top you confirmed a couple of
things, those being that a.) my performance was too low, and b.) this isn't
really a "lot" of records. (Oh yeah, and c.) that whacking the log is the
Wrong Thing.)

I had only just perfected my timings and discovered that my interface was
holding my performance back and with minor mods it jumped to around 80 per
second.

I also changed my database file growth to grow by 10 mb increments... sounds
reasonable. Initial file size, after a shrink, is 1.43MB, for the MDF and 504
kb for the log.

The database and schema is simple, with two tables -- collections (drives)
and files, in a simple one to many relationship.

The import process just reads in a file created by a batch file with
"dir /s/b >allfiles.txt" and stuffs one line at a time into the database
using a method, Adapter.InsertPath, created in the XSD schema.

I certainly believe that a bulk import function would be able to acheive
better performance, but by running loops the interface stays responsive and I
can give the user (me) feedback about the progress.

In fact, I've got one sort of worked out that uses a query like this:

Dim queryString As String = "INSERT into tblPaths (Path, IID) " &
vbCrLf & _
"SELECT Path, " & CType(myIndex("IID"), Integer) & " AS IID " &
vbCrLf & _
"from OPENROWSET (BULK '" & strPath & "' , FORMATFILE = '" &
strCurDir & "\BulkImport.xml') as Path;"

And the referenced XML file looks like this:

<?xml version="1.0" encoding="utf-8" ?>
<BCPFORMAT
xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RECORD>
<FIELD ID="1" xsi:type="CharTerm" TERMINATOR="\r\n" MAX_LENGTH="355"
COLLATION="SQL_Latin1_General_CP1_CI_AS" />
</RECORD>
<ROW>
<COLUMN SOURCE="1" NAME="Path" xsi:type="SQLVARYCHAR" />
</ROW>
</BCPFORMAT>

I also need to compare the performance of just enumerating the filesystem to
create index, too, but I can definitely say I've never found anything that
could list the files faster than the DOS dir /s/b command.

--Jon
 
Curiouser and curiouser ....

I just did some tests of pumping several hundred rows into a database.

I did a dir /s/b > C:\test.txt. The result was about 147,000 rows and so I
replicated the content twice more.

This is the table schema:

create table path
(
path varchar(355) not null,
iid int not null,
)
go

For the first test I created the database with the defaults (3MB + 1 MB and
1MB + 10% for the initial + growth factors for the mdf and ldf files
respectively).

For the iid I assigned 1. The results were:

443538 rows
Elapsed time = 552 secs
rows/sec = 803.510869565217
File size: 54MB and 1MB respectively

For the second test I created the database with 60MB + 1 MB and 1MB + 10%
for the initial + growth factors for the mdf and ldf files respectively.

The results were:

443538 rows
Elapsed time = 579.234575 secs
rows/sec = 765.731488225297
File size: 60MB and 1MB respectively

This surprised me because I expected it to be quicker because the .mdf was
already sized above the maximum needed. Why it was significantly slower
escapes me.

For the third test I created the database with the defaults (3MB + 1 MB and
1MB + 10% for the initial + growth factors for the mdf and ldf files
respectively).

This time I opend and closed the connection for each insert. The results
were:

443538 rows
Elapsed time = 706.40625 secs
rows/sec = 627.879495686795
File size: 54MB and 1MB respectively

As expected, significantly slower but still handling in excess of 600 rows
per second.

For the fourth test I add a primary key on (path,iid).

The results were:

443538 rows
Elapsed time = 756.671875 secs
rows/sec = 586.169533524687

Again a little slower but still acceptable.

All these tests were done over a network so network latentcy is a factor.

To read the input file only the results were:

443538 rows
Elapsed time = 0.65625 secs
rows/sec = 675867.428571429

so reading the data is an insignificant overhead.

I suspect that, to be getting the results you report, there must be other
code in your load process that you haven't told us about.

It might be an idea if you posted the pertinent code section
 
Another excellent reply! This is exactly the type of useful information that
I need to know about SQL server!

I'm going to try and cook my code a little differently (and check my timing
code -- could I really be that far off???)

One quick question, though; you mentioned network latency, I presume you're
talking about the connection to your database. In my application, however,
I've got a VB.Net project with the MDF file stored locally in the project.

On the one hand, I see that this eliminates network latency, but on the
other hand I'm wondering if there are other performance factors which affect
this type of usage of SQL. (ie, local MDF vs. "served" database, even though
I know that VB.Net automatically attaches and then serves anyhow).

Any thoughts on that?

--Jon
 
Hi Jon,

Thank you for posting here!

This is a quick note to let you know that I am performing research on this
issue and will get back to you ASAP.

I appreciate your patience!


Sincerely,
Linda Liu
Microsoft Online Community Support

==================================================
Get notification to my posts through email? Please refer to
http://msdn.microsoft.com/subscriptions/managednewsgroups/default.aspx#notif
ications.

Note: The MSDN Managed Newsgroup support offering is for non-urgent issues
where an initial response from the community or a Microsoft Support
Engineer within 1 business day is acceptable. Please note that each follow
up response may take approximately 2 business days as the support
professional working with you may need further investigation to reach the
most efficient resolution. The offering is not appropriate for situations
that require urgent, real-time or phone-based interactions or complex
project analysis and dump analysis issues. Issues of this nature are best
handled working with a dedicated Microsoft Support Engineer by contacting
Microsoft Customer Support Services (CSS) at
http://msdn.microsoft.com/subscriptions/support/default.aspx.
==================================================

This posting is provided "AS IS" with no warranties, and confers no rights.
 
I just did some more tests of pumping several hundred thousand rows into a
database.

I did the dir /s/b > C:\filelist.txt. and replicated to 509905 rows
made the table as below

create table filelist
(
file varchar(896) not null,
iid int not null,
)
go

(the 896 is coz the largest pk is 900 bytes.

For the iid I assigned incrementing numbers. The results were:

509905rows
Elapsed time = 12.8324091 secs
rows/sec = 39735.718837081

I added the same pk (path, iid) and got the below numbers

509905rows
Elapsed time = 68.0459818 secs
rows/sec = 7493.5357902353


No super hardware involved, just a cool new bulk copy feature of .net 2
Code :

private void button1_Click(object sender, EventArgs e)
{
Stopwatch sw = new Stopwatch();
sw.Start();
DataTable table = new DataTable();
table.Columns.Add(new DataColumn("File",typeof(string)));
table.Columns.Add(new DataColumn("IID",typeof(int)));
table.Columns[1].AutoIncrement = true;
table.Columns[1].AutoIncrementSeed = 1;
table.Columns[1].AutoIncrementStep = 1;


StreamReader sr = new StreamReader("c:\\filelist.txt");
while (!sr.EndOfStream ){
table.Rows.Add(sr.ReadLine());
}

sw.Stop();
Debug.Write(sw.Elapsed.TotalSeconds + " seconds for " + table.Rows.Count +
" = " + Convert.ToDouble(table.Rows.Count) / sw.Elapsed.TotalSeconds + " rows
per second loaded to datatable");
sw.Start();

SqlConnection sqlcon = new SqlConnection("data source=lon0371xns;initial
catalog=SonarBackup;integrated security=sspi");
SqlBulkCopy bc = new SqlBulkCopy(sqlcon);
bc.DestinationTableName = "FileList";
bc.NotifyAfter = 5000;
bc.SqlRowsCopied += new SqlRowsCopiedEventHandler(bc_SqlRowsCopied);
bc.ColumnMappings.Add(new SqlBulkCopyColumnMapping("File", "File"));
bc.ColumnMappings.Add(new SqlBulkCopyColumnMapping("IID", "IID"));
sqlcon.Open();
bc.BulkCopyTimeout = 500;
bc.WriteToServer(table);

sw.Stop();

Debug.Write(sw.Elapsed.TotalSeconds + " seconds for " + table.Rows.Count +
" = " + Convert.ToDouble(table.Rows.Count) / sw.Elapsed.TotalSeconds + " rows
per second loaded to db");
}

void bc_SqlRowsCopied(object sender, SqlRowsCopiedEventArgs e)
{
Debug.Write("Written ");
Debug.WriteLine(e.RowsCopied.ToString());
}
 
By the way. I think originally you may have been inserting each row in its
own transaction and failing to commit, leading to multi nested uncommitted
transactions a) slowing down your inserts, b) bloating your transaction log
c) locking the table after your finished, causing timeouts.


--
Ciaran O''''Donnell
http://wannabedeveloper.spaces.live.com


Ciaran O''Donnell said:
I just did some more tests of pumping several hundred thousand rows into a
database.

I did the dir /s/b > C:\filelist.txt. and replicated to 509905 rows
made the table as below

create table filelist
(
file varchar(896) not null,
iid int not null,
)
go

(the 896 is coz the largest pk is 900 bytes.

For the iid I assigned incrementing numbers. The results were:

509905rows
Elapsed time = 12.8324091 secs
rows/sec = 39735.718837081

I added the same pk (path, iid) and got the below numbers

509905rows
Elapsed time = 68.0459818 secs
rows/sec = 7493.5357902353


No super hardware involved, just a cool new bulk copy feature of .net 2
Code :

private void button1_Click(object sender, EventArgs e)
{
Stopwatch sw = new Stopwatch();
sw.Start();
DataTable table = new DataTable();
table.Columns.Add(new DataColumn("File",typeof(string)));
table.Columns.Add(new DataColumn("IID",typeof(int)));
table.Columns[1].AutoIncrement = true;
table.Columns[1].AutoIncrementSeed = 1;
table.Columns[1].AutoIncrementStep = 1;


StreamReader sr = new StreamReader("c:\\filelist.txt");
while (!sr.EndOfStream ){
table.Rows.Add(sr.ReadLine());
}

sw.Stop();
Debug.Write(sw.Elapsed.TotalSeconds + " seconds for " + table.Rows.Count +
" = " + Convert.ToDouble(table.Rows.Count) / sw.Elapsed.TotalSeconds + " rows
per second loaded to datatable");
sw.Start();

SqlConnection sqlcon = new SqlConnection("data source=lon0371xns;initial
catalog=SonarBackup;integrated security=sspi");
SqlBulkCopy bc = new SqlBulkCopy(sqlcon);
bc.DestinationTableName = "FileList";
bc.NotifyAfter = 5000;
bc.SqlRowsCopied += new SqlRowsCopiedEventHandler(bc_SqlRowsCopied);
bc.ColumnMappings.Add(new SqlBulkCopyColumnMapping("File", "File"));
bc.ColumnMappings.Add(new SqlBulkCopyColumnMapping("IID", "IID"));
sqlcon.Open();
bc.BulkCopyTimeout = 500;
bc.WriteToServer(table);

sw.Stop();

Debug.Write(sw.Elapsed.TotalSeconds + " seconds for " + table.Rows.Count +
" = " + Convert.ToDouble(table.Rows.Count) / sw.Elapsed.TotalSeconds + " rows
per second loaded to db");
}

void bc_SqlRowsCopied(object sender, SqlRowsCopiedEventArgs e)
{
Debug.Write("Written ");
Debug.WriteLine(e.RowsCopied.ToString());
}


--
Ciaran O''''Donnell
http://wannabedeveloper.spaces.live.com


Stephany Young said:
Curiouser and curiouser ....

I just did some tests of pumping several hundred rows into a database.

I did a dir /s/b > C:\test.txt. The result was about 147,000 rows and so I
replicated the content twice more.

This is the table schema:

create table path
(
path varchar(355) not null,
iid int not null,
)
go

For the first test I created the database with the defaults (3MB + 1 MB and
1MB + 10% for the initial + growth factors for the mdf and ldf files
respectively).

For the iid I assigned 1. The results were:

443538 rows
Elapsed time = 552 secs
rows/sec = 803.510869565217
File size: 54MB and 1MB respectively

For the second test I created the database with 60MB + 1 MB and 1MB + 10%
for the initial + growth factors for the mdf and ldf files respectively.

The results were:

443538 rows
Elapsed time = 579.234575 secs
rows/sec = 765.731488225297
File size: 60MB and 1MB respectively

This surprised me because I expected it to be quicker because the .mdf was
already sized above the maximum needed. Why it was significantly slower
escapes me.

For the third test I created the database with the defaults (3MB + 1 MB and
1MB + 10% for the initial + growth factors for the mdf and ldf files
respectively).

This time I opend and closed the connection for each insert. The results
were:

443538 rows
Elapsed time = 706.40625 secs
rows/sec = 627.879495686795
File size: 54MB and 1MB respectively

As expected, significantly slower but still handling in excess of 600 rows
per second.

For the fourth test I add a primary key on (path,iid).

The results were:

443538 rows
Elapsed time = 756.671875 secs
rows/sec = 586.169533524687

Again a little slower but still acceptable.

All these tests were done over a network so network latentcy is a factor.

To read the input file only the results were:

443538 rows
Elapsed time = 0.65625 secs
rows/sec = 675867.428571429

so reading the data is an insignificant overhead.

I suspect that, to be getting the results you report, there must be other
code in your load process that you haven't told us about.

It might be an idea if you posted the pertinent code section
 
Nice one Ciaran. I haven't had a play with the SqlBulkCopy object yet but I
certainly will be after that lot :)


Ciaran O''Donnell said:
I just did some more tests of pumping several hundred thousand rows into a
database.

I did the dir /s/b > C:\filelist.txt. and replicated to 509905 rows
made the table as below

create table filelist
(
file varchar(896) not null,
iid int not null,
)
go

(the 896 is coz the largest pk is 900 bytes.

For the iid I assigned incrementing numbers. The results were:

509905rows
Elapsed time = 12.8324091 secs
rows/sec = 39735.718837081

I added the same pk (path, iid) and got the below numbers

509905rows
Elapsed time = 68.0459818 secs
rows/sec = 7493.5357902353


No super hardware involved, just a cool new bulk copy feature of .net 2
Code :

private void button1_Click(object sender, EventArgs e)
{
Stopwatch sw = new Stopwatch();
sw.Start();
DataTable table = new DataTable();
table.Columns.Add(new DataColumn("File",typeof(string)));
table.Columns.Add(new DataColumn("IID",typeof(int)));
table.Columns[1].AutoIncrement = true;
table.Columns[1].AutoIncrementSeed = 1;
table.Columns[1].AutoIncrementStep = 1;


StreamReader sr = new StreamReader("c:\\filelist.txt");
while (!sr.EndOfStream ){
table.Rows.Add(sr.ReadLine());
}

sw.Stop();
Debug.Write(sw.Elapsed.TotalSeconds + " seconds for " + table.Rows.Count +
" = " + Convert.ToDouble(table.Rows.Count) / sw.Elapsed.TotalSeconds + "
rows
per second loaded to datatable");
sw.Start();

SqlConnection sqlcon = new SqlConnection("data source=lon0371xns;initial
catalog=SonarBackup;integrated security=sspi");
SqlBulkCopy bc = new SqlBulkCopy(sqlcon);
bc.DestinationTableName = "FileList";
bc.NotifyAfter = 5000;
bc.SqlRowsCopied += new SqlRowsCopiedEventHandler(bc_SqlRowsCopied);
bc.ColumnMappings.Add(new SqlBulkCopyColumnMapping("File", "File"));
bc.ColumnMappings.Add(new SqlBulkCopyColumnMapping("IID", "IID"));
sqlcon.Open();
bc.BulkCopyTimeout = 500;
bc.WriteToServer(table);

sw.Stop();

Debug.Write(sw.Elapsed.TotalSeconds + " seconds for " + table.Rows.Count +
" = " + Convert.ToDouble(table.Rows.Count) / sw.Elapsed.TotalSeconds + "
rows
per second loaded to db");
}

void bc_SqlRowsCopied(object sender, SqlRowsCopiedEventArgs e)
{
Debug.Write("Written ");
Debug.WriteLine(e.RowsCopied.ToString());
}


--
Ciaran O''''Donnell
http://wannabedeveloper.spaces.live.com


Stephany Young said:
Curiouser and curiouser ....

I just did some tests of pumping several hundred rows into a database.

I did a dir /s/b > C:\test.txt. The result was about 147,000 rows and so
I
replicated the content twice more.

This is the table schema:

create table path
(
path varchar(355) not null,
iid int not null,
)
go

For the first test I created the database with the defaults (3MB + 1 MB
and
1MB + 10% for the initial + growth factors for the mdf and ldf files
respectively).

For the iid I assigned 1. The results were:

443538 rows
Elapsed time = 552 secs
rows/sec = 803.510869565217
File size: 54MB and 1MB respectively

For the second test I created the database with 60MB + 1 MB and 1MB + 10%
for the initial + growth factors for the mdf and ldf files respectively.

The results were:

443538 rows
Elapsed time = 579.234575 secs
rows/sec = 765.731488225297
File size: 60MB and 1MB respectively

This surprised me because I expected it to be quicker because the .mdf
was
already sized above the maximum needed. Why it was significantly slower
escapes me.

For the third test I created the database with the defaults (3MB + 1 MB
and
1MB + 10% for the initial + growth factors for the mdf and ldf files
respectively).

This time I opend and closed the connection for each insert. The results
were:

443538 rows
Elapsed time = 706.40625 secs
rows/sec = 627.879495686795
File size: 54MB and 1MB respectively

As expected, significantly slower but still handling in excess of 600
rows
per second.

For the fourth test I add a primary key on (path,iid).

The results were:

443538 rows
Elapsed time = 756.671875 secs
rows/sec = 586.169533524687

Again a little slower but still acceptable.

All these tests were done over a network so network latentcy is a factor.

To read the input file only the results were:

443538 rows
Elapsed time = 0.65625 secs
rows/sec = 675867.428571429

so reading the data is an insignificant overhead.

I suspect that, to be getting the results you report, there must be other
code in your load process that you haven't told us about.

It might be an idea if you posted the pertinent code section
 
Sorry, my bad, guess I didn't notice that in there!

That's a serious performance difference, though! If I switched to that, I'm
fairly certain this would no longer be a problem area of my application.

--Jon
 
Back
Top