FILE I/O

  • Thread starter Thread starter SpookyET
  • Start date Start date
S

SpookyET

How do you delete some data from a file? The file size should shrink.
How do you insert data at a specific offset without overwriting?
Something like this:
xy xaabcy
01 012345
 
SpookyET said:
How do you delete some data from a file? The file size should shrink.

You have to copy the end parts down and shrink it.
How do you insert data at a specific offset without overwriting?

You have to move the end part out, and then write in the gap.


--
Chad Z. Hower (a.k.a. Kudzu) - http://www.hower.org/Kudzu/
"Programming is an art form that fights back"

Get your ASP.NET in gear with IntraWeb!
http://www.atozed.com/IntraWeb/
 
Here is how you insert some data:

fs = New FileStream("c:\News\fileio.txt", FileMode.Open,
FileAccess.ReadWrite)

iInsert = fs.Length / 2 ' just for a beginning location

iLen = sToWrite.Length

Dim temp(iLen - 1) As Byte

iPos = fs.Length - iLen


Do Until iPos = iInsert

iPos -= i

If iPos < iInsert Then iPos = iInsert

fs.Position = iPos

i = fs.Read(temp, 0, iLen)

fs.Write(temp, 0, i)

Loop

fs.Position = iInsert

fs.Write(ASCIIEncoding.ASCII.GetBytes(sToWrite), 0, iLen)

fs.Flush()

fs.Close()



and here is how you remove some data:



fs = New FileStream("c:\News\fileio.txt", FileMode.Open,
FileAccess.ReadWrite)

iRemove = fs.Length / 2 ' just for a beginning location

iLen = 3

Dim temp(iLen - 1) As Byte

iPos = iRemove + iLen

Do Until iPos = fs.Length - iLen

iPos += i

fs.Position = iPos

i = fs.Read(temp, 0, iLen)

fs.Position = iPos - iLen

fs.Write(temp, 0, i)

Loop

fs.SetLength(fs.Length - iLen)

fs.Flush()

fs.Close()


--
Eric Marvets
Principal Consultant

the bang project

<shameless self promotion>

Email (e-mail address removed) for Information on Our Architecture and
Mentoring Services

</shameless self promotion>
 
Thanks for the responses. How would you do this in an efficient way. I
have to deal with files from hundreds of mebibytes to tens of gibibytes.
 
The sample I sent you does not read and write in efficient chunks. I would
play with different read and write blocks until you find the which one will
work the fastest. You don't want it too big or small.

Without knowing anything else about what you are doing, its hard to offer
any other advice. I can suggest you look at what others have done, for
example garbage collection, it reclaims space in memory at certain
intervals, doing more work at those intervals, but reducing the total amount
of work done overall. Or SQL for example writes in pages, and does not
immediatly reclaim space, instead it leaves the space open so that it can
write to it later, and if need be split a page.


--
Eric Marvets
Principal Consultant

the bang project

<shameless self promotion>

Email (e-mail address removed) for Information on Our Architecture and
Mentoring Services

</shameless self promotion>
 
SpookyET said:
Thanks for the responses. How would you do this in an efficient way. I
have to deal with files from hundreds of mebibytes to tens of gibibytes.
I assume this is for your BT client, in which case you need to either use
sparse files(which you've already suggested won't work) or perhaps a lazy
sorting system.

Considering what I know about BT, a file is split into multiple, fixed sized
chunks(zero extend the last chunk and deal with explicitly). Because of
this, you can use this algorithm.

1. Get a chunk
2. If that chunk would be beyond the end of the current file, write it to
the end,
otherwise, locate the place where the chunk belongs. Read the chunk in
its spot out, write the original chunk and repeat this step with the
newly read chunk until you finally find one that belongs at the
end.
3. Repeat until file is complete.
4. Final pass continuing sorting until all chunks are in the appropriate
slots

You will need to maintain a map between chunk IDs to slot IDs, but it might
result in considerably less IO with large files than if you were to copy the
end out every time. There is no guarentee considering that worst case will
be as bad in either situation and best case should be pretty much as good in
either situation. (I havn't tested this, so I don't know what the average
case will be, but its worth a shot). IF you choose to implement it, please
let me know if it performs and scales well enough for you.
 
Thanks! The problem with bittorrent is that a piece can go over file
boundaries. Some people care more about disk space usage than file
fragmentation, thus I have to implement this. Sorting is horrible if you
have to go through all the files to sort pieces.

After I have mentioned your algorithm, I got this in return:

[16:12] (TheSHAD0W): Heh.
[16:12] (TheSHAD0W): No.
[16:12] (TheSHAD0W): That'd work, but you'd wind up with spikes in usage.
[16:12] (TheSHAD0W): Here's how it's done, and Bram devised it...
[16:12] (TheSHAD0W): You get data for a new chunk.
[16:12] (TheSHAD0W): Allocate space on the end.
[16:12] *** blay ([email protected]) has joined
#BitTorrent
[16:12] (TheSHAD0W): If that space allocated is that chunk, write it
there. Done.
[16:13] (TheSHAD0W): If there already exists data for that piece in
previously allocated space, move it there, and its old spot is your new
open space.
[16:14] (TheSHAD0W): Then, check if the data you got the piece for is in
allocated space but something else is taking up its spot. If so, move
that data to the open spot.
[16:14] (TheSHAD0W): Finally, write your data into that spot.
[16:14] (TheSHAD0W): Results in a maximum of two moves in order to get
your data out.
 
SpookyET said:
Thanks! The problem with bittorrent is that a piece can go over file
boundaries. Some people care more about disk space usage than file
fragmentation, thus I have to implement this. Sorting is horrible if you
have to go through all the files to sort pieces.

After I have mentioned your algorithm, I got this in return:

[16:12] (TheSHAD0W): Heh.
[16:12] (TheSHAD0W): No.
[16:12] (TheSHAD0W): That'd work, but you'd wind up with spikes in usage.

He's right, it has a potential for spiking.
[16:12] (TheSHAD0W): Here's how it's done, and Bram devised it...
[16:12] (TheSHAD0W): You get data for a new chunk.
[16:12] (TheSHAD0W): Allocate space on the end.
[16:12] *** blay ([email protected]) has joined
#BitTorrent
[16:12] (TheSHAD0W): If that space allocated is that chunk, write it
there. Done.
[16:13] (TheSHAD0W): If there already exists data for that piece in
previously allocated space, move it there, and its old spot is your new
open space.
[16:14] (TheSHAD0W): Then, check if the data you got the piece for is in
allocated space but something else is taking up its spot. If so, move
that data to the open spot.
[16:14] (TheSHAD0W): Finally, write your data into that spot.
[16:14] (TheSHAD0W): Results in a maximum of two moves in order to get
your data out.
However, the problem with this, IMHO, is after writing a slot out you moved
data that *will* be moved again. Whereas above it might be. I think this
appraoch will result in more overal disk IO than my algorithm, but it will
probably be spread out further.

In this case, I would probably benchmark a number of different approaches.
Consider allocating pages(perhaps 4 chunks per page) and try to only permit
half of a page to be used as scratch at any given time. Or perhaps others.
Now I'm curious, lets see what I can think up...
 
I rewrote the algorithm that TheShadow mentioned to make it more clearly.

/*
* 1) A piece is received for storage.
* 2) Allocate space to hold the piece.
* 3) If the new allocated space is where the piece is supposed to be,
write it there. Return!
* 4) If the new allocated space is not where the received piece is
supposed to be, search for
* the piece that is supposed to there.
* 5) If the piece is not found, search for the location where the
received piece is supposed to be.
* 6) Move the piece found there into the new allocated space, and write
the received piece in the
* old location. Return!
* 6) If the piece that is supposed to be in the new allocated space was
found, move it there.
* 7) Search for the location where the received piece is supposed to be.
If there is a piece
* there, move it into the new empty space, and write the received
piece there. Return!
*
* The result is a maximum of 2 moves and 3 writes.
* If you have 8 (0-7) pieces and you get 6,4,3,5,7,1,0,2:
* 0) You get 6; 6 goes in 0.
* 1) You get 4; 4 goes in 1.
* 2) You get 3; 3 goes in 2.
* 3) You get 5; 3 goes in 3; 5 goes in 2.
* 4) You get 7; 4 goes in 4; 7 goes in 1.
* 5) You get 1; 5 goes in 5; 1 goes in 1; 7 goes in 2.
* 6) You get 0; 6 goes in 6; 0 goes in 0.
* 7) You get 2; 7 goes in 7; 2 goes in 2.
*/

I'm not very good at writing with the fewest words possible.
If you can rewrite what I have written above in a more clear way with less
words, I would appreciate if you post it.

You still have to deal with pieces that span multiple files.
You have to create a virtual file system that puts the files end to end,
like a chain.
If you wish to see how it works, draw some boxes on a paper that represent
pieces and play with it



SpookyET said:
Thanks! The problem with bittorrent is that a piece can go over file
boundaries. Some people care more about disk space usage than file
fragmentation, thus I have to implement this. Sorting is horrible if
you
have to go through all the files to sort pieces.

After I have mentioned your algorithm, I got this in return:

[16:12] (TheSHAD0W): Heh.
[16:12] (TheSHAD0W): No.
[16:12] (TheSHAD0W): That'd work, but you'd wind up with spikes in
usage.

He's right, it has a potential for spiking.
[16:12] (TheSHAD0W): Here's how it's done, and Bram devised it...
[16:12] (TheSHAD0W): You get data for a new chunk.
[16:12] (TheSHAD0W): Allocate space on the end.
[16:12] *** blay ([email protected]) has
joined
#BitTorrent
[16:12] (TheSHAD0W): If that space allocated is that chunk, write it
there. Done.
[16:13] (TheSHAD0W): If there already exists data for that piece in
previously allocated space, move it there, and its old spot is your new
open space.
[16:14] (TheSHAD0W): Then, check if the data you got the piece for is in
allocated space but something else is taking up its spot. If so, move
that data to the open spot.
[16:14] (TheSHAD0W): Finally, write your data into that spot.
[16:14] (TheSHAD0W): Results in a maximum of two moves in order to get
your data out.
However, the problem with this, IMHO, is after writing a slot out you
moved
data that *will* be moved again. Whereas above it might be. I think this
appraoch will result in more overal disk IO than my algorithm, but it
will
probably be spread out further.

In this case, I would probably benchmark a number of different
approaches.
Consider allocating pages(perhaps 4 chunks per page) and try to only
permit
half of a page to be used as scratch at any given time. Or perhaps
others.
Now I'm curious, lets see what I can think up...
 
Sorry about the delay, I've been busy.
SpookyET said:
I rewrote the algorithm that TheShadow mentioned to make it more clearly.

/*
* 1) A piece is received for storage.
* 2) Allocate space to hold the piece.
* 3) If the new allocated space is where the piece is supposed to be,
write it there. Return!
* 4) If the new allocated space is not where the received piece is
supposed to be, search for
* the piece that is supposed to there.
* 5) If the piece is not found, search for the location where the
received piece is supposed to be.
* 6) Move the piece found there into the new allocated space, and write
the received piece in the
* old location. Return!
* 6) If the piece that is supposed to be in the new allocated space was
found, move it there.
* 7) Search for the location where the received piece is supposed to be.
If there is a piece
* there, move it into the new empty space, and write the received
piece there. Return!
*
* The result is a maximum of 2 moves and 3 writes.
* If you have 8 (0-7) pieces and you get 6,4,3,5,7,1,0,2:
* 0) You get 6; 6 goes in 0.
* 1) You get 4; 4 goes in 1.
* 2) You get 3; 3 goes in 2.
* 3) You get 5; 3 goes in 3; 5 goes in 2.
* 4) You get 7; 4 goes in 4; 7 goes in 1.
* 5) You get 1; 5 goes in 5; 1 goes in 1; 7 goes in 2.
* 6) You get 0; 6 goes in 6; 0 goes in 0.
* 7) You get 2; 7 goes in 7; 2 goes in 2.
*/

I'm not very good at writing with the fewest words possible.
If you can rewrite what I have written above in a more clear way with less
words, I would appreciate if you post it.
Your above sounds pretty fair. All I would do is reword it(and probably
lengthen it). If you understand it don't worry about the length.
You still have to deal with pieces that span multiple files.
You have to create a virtual file system that puts the files end to end,
like a chain.
If you wish to see how it works, draw some boxes on a paper that represent
pieces and play with it

Multiple files...now there is a problem. For .NET I would probably just
create a stream that aggregates the files and does the writting to the
proper one(I assume thats what you are doing). If you do that you can just
generalize the above algorithm into that stream and forget about the
underlying files.
 
Back
Top