IO optimization when copying bytes from one file to another

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

Hi guys. I've written code to embed an ICC profile in a TIFF image, and I
think my IO operations are slowing things down. It is taking about a second
to embed each tag in 7-meg TIFF files. Doesn't sound too bad until you try
doing it to 500 files. Basically what I am doing is this:

1. Read header from image, add 12-byte tag to header, and write to a new
TIFF file
2. Update all the offset pointers in other tags in TIFF header to reflect
the change that adding the 12-bytes made and write them to the new TIFF file
3. Copy the rest of the original TIFF to the new one.
4. Append ICC profile (around 100K) to new TIFF file.
5. Delete original TIFF and rename new one to the name of the original

I believe the roblem to be in the method I am using the copy the data from
the original file to the new one. I have pasted it below. Any suggestions
on how to squeeze some more speed out, either in my main algorithm or the
following function? Thanks a bunch!

Josh

private void copyBytes(FileStream source, FileStream destination, long
fromIndex, long length)
{
const int chunkSize = 1024;
long currentIndex = fromIndex;
long endIndex = fromIndex + length;
long bytesToCopy = 0;
byte[] bytes = new byte[chunkSize];
byte[] endLump;
byte[] twoBytes = new byte[2];

source.Seek(fromIndex, SeekOrigin.Begin);
//Copy a chunk at a time
for(bytesToCopy = length; bytesToCopy >= chunkSize; bytesToCopy -=
chunkSize)
{
source.Read(bytes, 0, chunkSize);
destination.Write(bytes, 0, chunkSize);
currentIndex += bytes.Length;
}

//Copy the rest now
endLump = new byte[bytesToCopy];
source.Read(endLump, 0, endLump.Length);
destination.Write(endLump, 0, endLump.Length);
destination.Flush();
}
 
Skwerl said:
Hi guys. I've written code to embed an ICC profile in a TIFF image, and I
think my IO operations are slowing things down. It is taking about a second
to embed each tag in 7-meg TIFF files. Doesn't sound too bad until you try
doing it to 500 files. Basically what I am doing is this:

1. Read header from image, add 12-byte tag to header, and write to a new
TIFF file
2. Update all the offset pointers in other tags in TIFF header to reflect
the change that adding the 12-bytes made and write them to the new TIFF file
3. Copy the rest of the original TIFF to the new one.
4. Append ICC profile (around 100K) to new TIFF file.
5. Delete original TIFF and rename new one to the name of the original

I believe the roblem to be in the method I am using the copy the data from
the original file to the new one. I have pasted it below. Any suggestions
on how to squeeze some more speed out, either in my main algorithm or the
following function? Thanks a bunch!

You can make your method simpler and more reliable (by using the return
value of Read) quite easily. I've also increased the chunk size to
possibly speed things up a bit.

const int BufferSize = 32768;

void CopyBytes (Stream source, Stream dest, long fromIndex,
long length)
{
source.Seek(fromIndex, SeekOrigin.Begin);
byte[] buffer = new byte[BufferSize];

while (length > 0)
{
int read = source.Read(buffer, 0,
Math.Min (length, BufferSize));

if (read <= 0)
{
throw new IOException ("Insufficient data remaining");
}

dest.Write (buffer, 0, read);
length -= read;
}
}

Assuming you're going to call Close or Dispose on the destination
stream, chances are you don't need to call Flush by the way.
 
Thanks, Jon. I appreciate the response. I tried your code, and found it was
about 15% slower than what I tried doing last night, which was to just read
it all in and then write all of it out. I'm guessing it was just because of
the small (7-meg) file size. here's my code:
private void copyBytes(FileStream source, FileStream destination, long
fromIndex, long length)
{
byte[] buffer = new byte[length];
source.Seek(fromIndex, SeekOrigin.Begin);
source.Read(buffer, 0, buffer.Length);
destination.Write(buffer, 0, buffer.Length);
}


Jon Skeet said:
Skwerl said:
Hi guys. I've written code to embed an ICC profile in a TIFF image, and I
think my IO operations are slowing things down. It is taking about a second
to embed each tag in 7-meg TIFF files. Doesn't sound too bad until you try
doing it to 500 files. Basically what I am doing is this:

1. Read header from image, add 12-byte tag to header, and write to a new
TIFF file
2. Update all the offset pointers in other tags in TIFF header to reflect
the change that adding the 12-bytes made and write them to the new TIFF file
3. Copy the rest of the original TIFF to the new one.
4. Append ICC profile (around 100K) to new TIFF file.
5. Delete original TIFF and rename new one to the name of the original

I believe the roblem to be in the method I am using the copy the data from
the original file to the new one. I have pasted it below. Any suggestions
on how to squeeze some more speed out, either in my main algorithm or the
following function? Thanks a bunch!

You can make your method simpler and more reliable (by using the return
value of Read) quite easily. I've also increased the chunk size to
possibly speed things up a bit.

const int BufferSize = 32768;

void CopyBytes (Stream source, Stream dest, long fromIndex,
long length)
{
source.Seek(fromIndex, SeekOrigin.Begin);
byte[] buffer = new byte[BufferSize];

while (length > 0)
{
int read = source.Read(buffer, 0,
Math.Min (length, BufferSize));

if (read <= 0)
{
throw new IOException ("Insufficient data remaining");
}

dest.Write (buffer, 0, read);
length -= read;
}
}

Assuming you're going to call Close or Dispose on the destination
stream, chances are you don't need to call Flush by the way.
 
Skwerl said:
Thanks, Jon. I appreciate the response. I tried your code, and found it was
about 15% slower than what I tried doing last night, which was to just read
it all in and then write all of it out. I'm guessing it was just because of
the small (7-meg) file size. here's my code:
private void copyBytes(FileStream source, FileStream destination, long
fromIndex, long length)
{
byte[] buffer = new byte[length];
source.Seek(fromIndex, SeekOrigin.Begin);
source.Read(buffer, 0, buffer.Length);
destination.Write(buffer, 0, buffer.Length);
}

Did you run mine and then run yours? If so, things would be buffered.
You should either flush all buffers before running either of them, or
run both several times.

I'd be surprised if my method was really 15% slower (a little bit, but
not 15%). Of course, I've been known to be surprised before :)
 
Actually, I ran mine a few times last night, and it consistently ran
around 730 ms per image in trials of 30 images each. I changed the code
altogether to yours this morning, ran it and got about 860 ms per image on
several trials of the same 30 images. I changed it back to mine as it was
previously, and again ended up with about 730 ms per image. I'm a real
novice when it comes to this sort of operation, and I've always seen these
things done in pieces like with your method, so I'm surprised the simplistic
way I did it is actually a little faster. Just FYI, the system I am running
this on is an XP SP2 with NTFS partitions. Performance aside, is your method
a better way to do this?
Thanks once again,
Josh

Jon Skeet said:
Skwerl said:
Thanks, Jon. I appreciate the response. I tried your code, and found it was
about 15% slower than what I tried doing last night, which was to just read
it all in and then write all of it out. I'm guessing it was just because of
the small (7-meg) file size. here's my code:
private void copyBytes(FileStream source, FileStream destination, long
fromIndex, long length)
{
byte[] buffer = new byte[length];
source.Seek(fromIndex, SeekOrigin.Begin);
source.Read(buffer, 0, buffer.Length);
destination.Write(buffer, 0, buffer.Length);
}

Did you run mine and then run yours? If so, things would be buffered.
You should either flush all buffers before running either of them, or
run both several times.

I'd be surprised if my method was really 15% slower (a little bit, but
not 15%). Of course, I've been known to be surprised before :)
 
Skwerl said:
Actually, I ran mine a few times last night, and it consistently ran
around 730 ms per image in trials of 30 images each. I changed the code
altogether to yours this morning, ran it and got about 860 ms per image on
several trials of the same 30 images. I changed it back to mine as it was
previously, and again ended up with about 730 ms per image. I'm a real
novice when it comes to this sort of operation, and I've always seen these
things done in pieces like with your method, so I'm surprised the simplistic
way I did it is actually a little faster. Just FYI, the system I am running
this on is an XP SP2 with NTFS partitions. Performance aside, is your method
a better way to do this?

Yes, in terms of memory consumption. Consider an image which is several
hundred megs in size - with my code, you never need to have more than
32K in memory at a time. With yours, you read the whole thing into
memory, and then write the whole thing out. You're also assuming that
one call to Read will read the whole file, ignoring the return value,
which is never a good idea.
 
Yes, I am assuming that it will read it all. Yikes, I didn't realize that it
wouldn't always read it all, unless an excepteion were thrown. Under what
conditions would it not read all of the file? Performance is a big issue
here, so I want to try to guage whether or not I need to worry about this.
The code will never need to handle anything but 3-7 meg TIFF files. Thanks
once again, Jon.

Josh
 
Skwerl said:
Yes, I am assuming that it will read it all. Yikes, I didn't realize that it
wouldn't always read it all, unless an excepteion were thrown. Under what
conditions would it not read all of the file?

I don't know, for sure. I would imagine that some network file systems
might give data in chunks, like NetworkStreams do. It could be that
FileStreams will always read however much you ask for - but that's not
true for streams in general. Basically it's good practice not to ignore
the return value of Read :)
Performance is a big issue
here, so I want to try to guage whether or not I need to worry about this.
The code will never need to handle anything but 3-7 meg TIFF files.

Why not try increasing the buffer size of the code I gave you to, say,
1MB. That way you won't need to change the code if you ever get a huge
file, and you don't need to worry about whether or not FileStream will
always return the whole of the data. The performance difference should
be trivial at that stage.
 
Back
Top