Sort by line/sort by sentence -suggestions ?

  • Thread starter Thread starter John Fitzsimons
  • Start date Start date
J

John Fitzsimons

I find that if one has text files eg. 300K+ that many text editors
crash, or hang.

(A) Does anyone have a text editor recommendation that they know
will effortless "sort" files at least 300K+ long ?

(B) Has anyone ever come across a text editor that will sort by
sentence ? I have never heard of one. Has anyone else ?


Regards, John.
 
I find that if one has text files eg. 300K+ that many text editors
crash, or hang.

(A) Does anyone have a text editor recommendation that they know
will effortless "sort" files at least 300K+ long ?

(B) Has anyone ever come across a text editor that will sort by
sentence ? I have never heard of one. Has anyone else ?


Regards, John.

I can't think of a freeware one, no. Unless you:
(A)
<OT>
Use UltraEdit (Shareware) http://www.idmcomp.com/products/features.html
</OT>

or

Turn your text file into a .csv; import into a spreadsheet and sort the
fields

(B) If the sentences are all newlines, then they can be sorted.
Otherwise, I believe you're stuffed.
 
John said:
I find that if one has text files eg. 300K+ that many text editors
crash, or hang.
(A) Does anyone have a text editor recommendation that they know will
effortless "sort" files at least 300K+ long ?

Are they sensitive, or can you stick one online?[1] If you can, I'll
d/l it and see what I can do here, and report back.

[1]Could be your Big Little Black Book <g>, could be public domain "Moby
Dick" -- can't tell from here.
 
John Fitzsimons Wrote in alt.comp.freeware, on Tue, 29 Jul 2003 10:15:11 +1000:
I find that if one has text files eg. 300K+ that many text editors
crash, or hang.
(A) Does anyone have a text editor recommendation that they know
will effortless "sort" files at least 300K+ long ?
(B) Has anyone ever come across a text editor that will sort by
sentence ? I have never heard of one. Has anyone else ?


Most if not all versions of windows have a commandline utility called
'sort'
type sort /? at the prompt for more info.
 
(A) Does anyone have a text editor recommendation that they know
will effortless "sort" files at least 300K+ long ?

(B) Has anyone ever come across a text editor that will sort by
sentence ? I have never heard of one. Has anyone else ?

NoteTab Light might work ...
Menu option Modify | Lines | Sort

http://www.notetab.com/
 
Blinky said:
John Fitzsimons wrote:
Are they sensitive, or can you stick one online?[1] If you can, I'll
d/l it and see what I can do here, and report back.

Clarification: that's in terms of evaluation of what might fit your
need, not to do your editing for you. :)
 
Turn your text file into a .csv; import into a spreadsheet and sort the
fields
(B) If the sentences are all newlines, then they can be sorted.
Otherwise, I believe you're stuffed.

What about replacing each period with period/newline/newline and then
sorting? It's crude, but might work.

I must say I wonder what's going on down under there John though.
 
John Fitzsimons Wrote in alt.comp.freeware, on Tue, 29 Jul 2003 10:15:11 +1000:
Most if not all versions of windows have a commandline utility called
'sort'
type sort /? at the prompt for more info.

That DOES work VERY well. :-)

As I want to do more than just sort I am still interested in any
suitable editors though.

Many thanks for your help however. :-)


Regards, John.
 
What about replacing each period with period/newline/newline and then
sorting? It's crude, but might work.

That's what I might have to do. The main drawback is that many editors
don't let me do much with large files. Even search and replace. Added
to that I need to address word wrapping. For example, I would want the
end of your first sentence to be re-formed as ;

newline and then sorting?

NOT

newline and thensorting?
I must say I wonder what's going on down under there John though.

Just trying to work out ways to sort/index news posts. I tried
importation into Keynote quite a while ago but some posts wouldn't
import. As I had no idea which ones did/didn't I couldn't manually add
the missing ones. :-(


Regards, John.
 
I haven't really tried sorting and such with larger files, but I've
never had any problems in editing them. How much ram and free disk
space do you have? And have you set Windoze to use all the disk space
it needs that is available? Or did you set a limit?

Control Panel / System / Performance / Virtual Memory (in 98SE)

In you set a limit above, or if you are low on disk space you will
have limitations in what you can open and work with. I doubt a 300k
file would push you over the limit unless your drive is full and you
don't have much ram though.
That's what I might have to do. The main drawback is that many editors
don't let me do much with large files. Even search and replace. Added
to that I need to address word wrapping. For example, I would want the
end of your first sentence to be re-formed as ;
newline and then sorting?

newline and thensorting?

I see. I have a utility for word wrapping. It could be tweaked to suit
your needs. I'll bet existing wares will work though.

BKReplacem should work on a 300k file easily if you have the
resources. You might first replace all newline characters with
nothing, leaving a single line of text. Then replace all ending
punctuation characters with character/newline/newline. This will break
the single line of text into individual sentences with a blank line
between them. Sort and then open in an editor that applies word
wrapping, or you could use the little utility I wrote to format as
desired.

(. ! ? ")
Just trying to work out ways to sort/index news posts. I tried
importation into Keynote quite a while ago but some posts wouldn't
import. As I had no idea which ones did/didn't I couldn't manually add
the missing ones. :-(

Can you email me one of your files zipped? I'd like to try an older
DOS program that handles files over a gig on it and see what happens.
Remove the REM to mail.
 
John said:
On 29 Jul 2003 01:12:18 GMT, Blinky the Shark <[email protected]>
wrote:
I find that if one has text files eg. 300K+ that many text editors
crash, or hang.
(A) Does anyone have a text editor recommendation that they know will
effortless "sort" files at least 300K+ long ?
Are they sensitive, or can you stick one online?[1] If you can, I'll
d/l it and see what I can do here, and report back.
Thanks Blinky but that wouldn't help me next time I needed to try
that.
Another post explained that I meant "so I can see if I have anything
that would do what you want, and advise you of what program that was".
Not to do the work for you. :)

Rather than me send you a couple of MB of text file you could just
save a few thousand posts in your largest newsgroup and open the
result in a text editor. That should result in a file of many hundreds
of thousands of lines.

OR

you could grab my 90K line file at ;

http://members.optushome.com.au/jfweb/90klines.txt

and append it maybe a half dozen (or more) times.

Count the lines then "sort". Did it work ? Or did the text editor
hang ? Or crash ?

While you are messing with this big file you might like to see if you
can remove all ">" from that. Did that make whatever you used
hang ? Or crash ? Or is it now totally missing all ">"s ?


Regards, John.
 
I haven't really tried sorting and such with larger files, but I've
never had any problems in editing them. How much ram
765MB.

and free disk
space do you have?

Tens of Gig.
And have you set Windoze to use all the disk space
it needs that is available? Or did you set a limit?
Control Panel / System / Performance / Virtual Memory (in 98SE)

Windows sets it.
In you set a limit above, or if you are low on disk space you will
have limitations in what you can open and work with. I doubt a 300k
file would push you over the limit unless your drive is full and you
don't have much ram though.

Well, some of the files were larger than 300K. :-)
I see. I have a utility for word wrapping. It could be tweaked to suit
your needs. I'll bet existing wares will work though.

Possibly. I will spend more time on this.
BKReplacem should work on a 300k file easily if you have the
resources. You might first replace all newline characters with
nothing, leaving a single line of text. Then replace all ending
punctuation characters with character/newline/newline. This will break
the single line of text into individual sentences with a blank line
between them. Sort and then open in an editor that applies word
wrapping, or you could use the little utility I wrote to format as
desired.
(. ! ? ")

Name ? Download location ? :-)
Can you email me one of your files zipped? I'd like to try an older
DOS program that handles files over a gig on it and see what happens.
Remove the REM to mail.

It would probably be easier for you to just grab my 90K line file at ;

http://members.optushome.com.au/jfweb/90klines.txt

and append it maybe a half dozen, or more, times.

Count the lines then "unwrap/sort". Did it work ? Or did the file/text
editor hang ? Or crash ?

If you "appended" say perhaps 8 times (or more ?) then after your sort
you should end up with every line (sentence ?) unwrapped and/or
sorted.

Now, how about seeing if you can, remove all ">" from that. Did that
make things hang ? Or crash ?


Regards, John.
 
John said:
On 30 Jul 2003 03:29:19 GMT, Blinky the Shark <[email protected]>
wrote:
John said:
On 29 Jul 2003 01:12:18 GMT, Blinky the Shark <[email protected]>
wrote:
I find that if one has text files eg. 300K+ that many text editors
crash, or hang.
(A) Does anyone have a text editor recommendation that they know will
effortless "sort" files at least 300K+ long ?
Are they sensitive, or can you stick one online?[1] If you can, I'll
d/l it and see what I can do here, and report back.
Thanks Blinky but that wouldn't help me next time I needed to try
that.
Another post explained that I meant "so I can see if I have anything
that would do what you want, and advise you of what program that was".
Not to do the work for you. :)
Rather than me send you a couple of MB of text file you could just
save a few thousand posts in your largest newsgroup and open the
result in a text editor. That should result in a file of many hundreds
of thousands of lines.

you could grab my 90K line file at ;

and append it maybe a half dozen (or more) times.

I copied it

-rw-rw-r-- 1 blinky blinky 2971674 Jul 30 20:15 90klines.txt

to nine files, serially numbered:

-rw-r--r-- 1 blinky blinky 2971687 Jul 30 20:18 1
-rw-r--r-- 1 blinky blinky 2971674 Jul 30 20:17 2
-rw-r--r-- 1 blinky blinky 2971674 Jul 30 20:17 3
-rw-r--r-- 1 blinky blinky 2971674 Jul 30 20:17 4
-rw-r--r-- 1 blinky blinky 2971674 Jul 30 20:17 5
-rw-r--r-- 1 blinky blinky 2971674 Jul 30 20:17 6
-rw-r--r-- 1 blinky blinky 2971674 Jul 30 20:17 7
-rw-r--r-- 1 blinky blinky 2971674 Jul 30 20:17 8
-rw-r--r-- 1 blinky blinky 2971674 Jul 30 20:17 9

I concatenated them to create one large file of about 27mb and
about 800,000 lines:

-rw-r--r-- 1 blinky blinky 26745079 Jul 30 20:25 big
Count the lines then "sort". Did it work ? Or did the text editor
hang ? Or crash ?

Hang? Crash? What are these things? ;)

I stuck a sorting script[1] in what, in Windows terms, would probably
be my vim[2] "ini" file.

I opened the file in vim, ran the script, and was prompted for which
column to sort on. I accepted the default column 1. This sorted
the file into a new file (to leave the original intact), named big.tmp
While you are messing with this big file you might like to see if you
can remove all ">" from that. Did that make whatever you used
hang ? Or crash ? Or is it now totally missing all ">"s ?

Hang? Crash? Explain, please?[3] ;)

Still within vim, I issued the search/replace command to remove instances
of one or more ">" characters. With the greater-thans stripped out, the
file was reduced to this size:

-rw-r--r-- 1 blinky blinky 26462452 Jul 30 21:04 big.tmp
Regards, John.

[1] http://vim.sourceforge.net/scripts/script.php?script_id=310

[2] http://www.vim.org/

[3]While it was removing the greater-thans, I was browsing. That
process took about 7 minutes (1.1 gig AMD Athlon / 448 meg RAM).
The sort had taken about 1.5 minutes.
 
You have the resources then.

That's what I thought.

Okay. Thanks.
I can see a problem though. There are many URL's in your text. This
makes it impossible to use '.' as a break point, as each portion of an
address will be broken into a 'sentence'.


url/

Not if the break point is a full stop and a space. If I break there I
should be okay. The only problem is that where a sentence wraps after
a full stop there is no full stop and then space.

I *could* perhaps get around that problem by replacing all full stops
followed by a carriage (soft ?) (hard ?) return by a full stop and
space.

Not sure what the syntax for that would be though in Bk ReplaceEm
I wonder if there is a difference between soft returns and hard
returns ?
There is a similar problem in mass stripping of '>'.

Yes, I thought there might be. Replacing only multiple ">" might be a
compromise solution.
Here are a couple of utilities I came across that you might have an
interest in:
Clippy (looks very nice!)

Yes, a great program, but it only appeared to hang when I gave it a
90K line file to process. Gave it over half an hour to "process" but
that didn't seem to help. :-(

Haven't tried that. I doubt that it will do any better than Clippy but
will give it a go.

Regards, John.
 
John Fitzsimons wrote:
That's what it is? Full posts, including headers, all appended?
Yep.
Okay, I'll grab that. That's easy.
No reformatting? Every line is an equal? As found?

Well, I haven't worked out how best to re-format things yet. The
question however is purely academic if a text editor hangs, or
crashes, every time I try to do something to the file !
Will get back to you... [watch this space <g>].

Okay. Thanks. :-)


Regards, John.
 
For those who prefer testimonials about programs, I think this one is
worthy. John was having problems editing/search and replace/sorting
large text files. Many editor authors refer to a 2 meg file as HUGE
when they say their editor can handle huge files. This particular one
is 30 megs and most everything that I've tried so far has failed
miserably. Bk ReplaceEm can hang, but editors have been a problem.

The only editor I've found so far for Windoze:

EditPad Lite

http://www.editpadpro.com/editpadlite.html

Consider this an early nomination. Give it a spin and see what you
think. It just became my default editor. There is an option to store
settings in an .ini file and to leave the registry clean. Great
program!

The test file (if you don't have one) is the text at this link, copied
to itself ten times (~30 meg text file, 28+ million characters):

http://members.optushome.com.au/jfweb/90klines.txt

======================================================
Not if the break point is a full stop and a space. If I break there I
should be okay. The only problem is that where a sentence wraps after
a full stop there is no full stop and then space.

Or "newline/>"
I *could* perhaps get around that problem by replacing all full stops
followed by a carriage (soft ?) (hard ?) return by a full stop and
space.

I just pasted in a newline/> and the one I'm trying now works fine.

EditPad Lite

http://www.editpadpro.com/editpadlite.html

A keeper for sure!

The file you listed was just under 3 megs (roughly 3000k). I copied it
to itself ten times and the final file was just short of 30 megs. It
has search and replace and sorting as well. Press the Wordwrap button
and the text automatically extends. Do your search and replace and it
should be ready for sorting.

There are things like ascii art that are going to be thrown in:
(pretty nice actually)

--- - - - - - - - - - - - -
/ \ __ / - - - - - -
/ / \ ( ) / - - - - -
/ / / / / / / \/ \ - - - -
/ / / / / / / : : - - -
/ / / / / ' ' - -
/ / / / .\ \
=====UU==UU=====
' / / /||\ \ \ '
''''

There are typos also.Like where there is no space after the
punctuation. ^

There are paragraphs like the one below that have a skewed format:

"DMF and 1.68 MB formats are the same physical format of 80 tracks and
21
sectors per trackThe 1.68 MB format has 224 entries in the root
directory,
and
a cluster size is 512 bytesDMF format has only 16 entries in the root
directory
(you need create a subdirectory to copy more than 16 files), and the
cluster
size
is 1024 (DMF 1024) or 2048 bytes (DMF 2048)You can check the cluster
size
in the Image Information..."
Not sure what the syntax for that would be though in Bk ReplaceEm
I wonder if there is a difference between soft returns and hard
returns ?

I dunno, Try EditPad Lite though and paste the returns. I just looked
up and AVG was running in the background while I was searching and
replacing! Wow. This is THE best editor I've ever used.

From Blinky's trial:

"[3]While it was removing the greater-thans, I was browsing. That
process took about 7 minutes (1.1 gig AMD Athlon / 448 meg RAM).
The sort had taken about 1.5 minutes."

Mine:

500mhz Intel / 256 megs of RAM / Win98SE:

Very heavy background processing:
Open file: 45 seconds
Removing all greater thans (qoute marks) 1 minute 15 seconds.

I really thought linux might be superior. It seems to me that it is
more important to have a well written app and most of what I've tried
so far are poorly written for large text files. Two thumbs up!

There is no sort function unfortunately. The DOS sort might be a good
place to start. I think that you can do everything except sort with
this program though. The heavy background processing was my utility
working on re-wrapping the file. It ran for about 10 minutes. I was
able to open the new file with the olf file still open, so there is 60
megs opened with no problem.
 
For those who prefer testimonials about programs, I think this one is
worthy. John was having problems editing/search and replace/sorting
large text files. Many editor authors refer to a 2 meg file as HUGE
when they say their editor can handle huge files. This particular one
is 30 megs and most everything that I've tried so far has failed
miserably. Bk ReplaceEm can hang, but editors have been a problem.

I did post results of both the sort and the search/replace tests he
specified, with vim. No problems with either, with a 27mb text file
supplied[1] by John.

[1]it was 90,000 lines, and I appended it to itself to make one
810,000-line file of the above size, as he requested.

[moment later] As I was deleting the rest of the post, I saw this start
to fly by said:
I dunno, Try EditPad Lite though and paste the returns. I just looked
up and AVG was running in the background while I was searching and
replacing! Wow. This is THE best editor I've ever used.

That was my favorite Win text editor, as well. I never gave it *this*
(my, above) kind of workout, though!
From Blinky's trial:
"[3]While it was removing the greater-thans, I was browsing. That
process took about 7 minutes (1.1 gig AMD Athlon / 448 meg RAM).
The sort had taken about 1.5 minutes."

500mhz Intel / 256 megs of RAM / Win98SE:
Very heavy background processing:
Open file: 45 seconds
Removing all greater thans (qoute marks) 1 minute 15 seconds.

That's fast!

You took out groups of them, too? In pretesting with a small file, I
accidentally only removed "stand-alone" (single) greater-thans. Did you
use the starter file John linked, so that our files were close to the
same in terms of content? There were bazillions of g-t's in his file,
because it was a bunch of news posts head to tail, and was thus very
 
Blinky the Shark <[email protected]> wrote:

[referring to quote marks:]
You took out groups of them, too? In pretesting with a small file, I
accidentally only removed "stand-alone" (single) greater-thans. Did you
use the starter file John linked, so that our files were close to the
same in terms of content? There were bazillions of g-t's in his file,
because it was a bunch of news posts head to tail, and was thus very
g-t-heavy, between quotes and "<string>" formations.

BK ReplaceEm proved superior for removing the quote marks.

This file was John's file 40 times over. It is 118,226,241 bytes
(after removing the quote marks).. There were 892,320 changes. I used
9 filters, although more might be necessary for other similar files:

--------
(newline & >)
|
--------

That took all quote marks from that file in 4 minutes 40 seconds with
BK ReplaceEm.

EditPad Lite did not run out of memory with RamIdle loaded. It seems
to have loaded the file into memory and written it to the swap file
though. The ~118 meg file opened in 8 minutes and 40 seconds. When
serving a single filter I got tired of waiting after 10 minutes and
cranked up BK; the best replace tool.
 
Back
Top