Word frequency analysis

  • Thread starter Thread starter Edgar De Blieck
  • Start date Start date
E

Edgar De Blieck

I have a hundred short stories of varying length which I would like to
subject to word frequency analysis.

Is there a freeware programme which allows me to:

a) list the words in each story in alphabetical order

b) list the words in each story by word frequency (as a percentage of the
total number of words)

c) automatically compare the frequencies of recurring words

d) append tags to words, so that I can do a statistical breakdown of the
grammatical styles of the stories (eg count the numbers of
adjectives/adverbs/nouns etc and their proximities relative to each other)

EDEB.
 
Thanks Susan, I'll have a go with the ones marked STAT and see whether they
will do what I need!

Much obliged,

EDEB.
 
Edgar said:
Thanks Susan, I'll have a go with the ones marked STAT and see whether they
will do what I need!

You're welcome. I spotted one stand-alone app *but* - it's not for
Windows. :( There are a few other apps that might work. I can check
further if none of the text editors do the job.

Susan
 
I have a hundred short stories of varying length which I would like to
subject to word frequency analysis.

Is there a freeware programme which allows me to:

a) list the words in each story in alphabetical order

b) list the words in each story by word frequency (as a percentage of the
total number of words)

c) automatically compare the frequencies of recurring words

d) append tags to words, so that I can do a statistical breakdown of the
grammatical styles of the stories (eg count the numbers of
adjectives/adverbs/nouns etc and their proximities relative to each other)
<snip>

I was reading through a website mentioned by MightyKitten (thread: Retrro
2/28/2004), and after coming upon the items listed below, I remembered
reading your request for word count programs. Don't know if these will suit
your needs, but they look interesting.

http://members.cox.net/dos/txtms02.htm

1. WCNT- Count and analyze word frequency in text and HTML documents.
2. WC- Simple word count program.
3. wc24- Word counter also counts sentences, calculates readability index.
4. TI- ("Text Information") Comprehensive text file statistics generator.
 
You're welcome. I spotted one stand-alone app *but* - it's not for
Windows. :( There are a few other apps that might work. I can check
further if none of the text editors do the job.

Susan

Much appreciated, thanks! I'm still downloading and trying out the apps!

EDEB
 
I was reading through a website mentioned by MightyKitten (thread: Retrro
2/28/2004), and after coming upon the items listed below, I remembered
reading your request for word count programs. Don't know if these will suit
your needs, but they look interesting.

http://members.cox.net/dos/txtms02.htm

1. WCNT- Count and analyze word frequency in text and HTML documents.
2. WC- Simple word count program.
3. wc24- Word counter also counts sentences, calculates readability index.
4. TI- ("Text Information") Comprehensive text file statistics generator.

Many thanks, I'll give it a go!

EDEB.
 
Hmmmm. I can't seem to make this one work. How is it supposed to function?

EDEB

Hey, EDEB.

I'm not sure what it is you can't get to work:

1) Was it the link I provided? It works here (I doubt that is what you
meant, though);
2) Was it one of the programs shown on the link provided? If so which one?

In regard to (2) above, since more than one text analysis program was
listed on the page, I tried each, and the most comprehensive by far is:

*** WCNT- Count and analyze word frequency in text and HTML documents.

WCNT is a command-line DOS tool and has certain advantages over Windows
graphical interface programs if you know how to work with them. If you are
unfamiliar with command-line progs, I'll try to help a little with this
post.

Anyway, to begin, a significant issue with this program, irrespective of it
being a command-line tool, is that if passed a wildcard spec (e.g., *.txt)
instead of an exact file name (e.g., shortstory.txt), it performs its
analysis on all text in all files matching the criteria. This is either
good or bad depending on what you are trying to achieve; I think in your
case, trying to individually analyze many text files, this is a detriment.
It can be overcome, though, in any number of ways from manually inputting
each file for analysis to writing a script (batch file or VBScript) to
automate the process. The latter is ideal if you know how to use VBScript:
you could write code to simply loop through each file in a folder, analyze
the file by calling WCNT (wc.exe), print output to a text file, then repeat
the process however many times is required (depending on number of files
needing analysis).

I don't have time right now to write a generic VBScript, but here is the
sample code from a simple batch file that shows some of the functionality
of WCNT (paste to an empty file, change any of the path\file locations for
your particular PC, change option switches, if desired, as shown in the
WCNT documentation, save as somefile.bat, and execute:

REM Begin Batch File **

REM Switch: /l - Prints a sorted plain list of all used words.
"C:\Program Files\WordCount\WC.EXE" "C:\My Documents\Test.txt" /l >"C:\My
Documents\Desktop\TestResults.txt"

REM Switch: /lf - Prints a sorted list of all used words with their
corresponding frequencies.
"C:\Program Files\WordCount\WC.EXE" "C:\My Documents\Test.txt" /lf >>"C:\My
Documents\Desktop\TestResults.txt"

REM Switch: /h - Shows histogram.
"C:\Program Files\WordCount\WC.EXE" "C:\My Documents\Test.txt" /h >>"C:\My
Documents\Desktop\TestResults.txt"

REM Switch: /hd - Shows histogram for distinct words.
"C:\Program Files\WordCount\WC.EXE" "C:\My Documents\Test.txt" /hd >>"C:\My
Documents\Desktop\TestResults.txt"

REM End Batch File **

Note that the syntax is basically:

wc.exe inputfile /x >outputfile

">" = DOS redirect symbol which directs normal console (screen) output to
specified file. Overwrites anything in file if already exists.
">>" = DOS redirect symbol as above BUT appends text to existing file.
If neither ">" or ">>" is used, output is sent to console (screen) only.
The ">" and ">>" symbols are not documented in the WCNT documentation. WCNT
provides for printing to file also (as opposed to console) with "@" and
"@@" (both of which are documented), but these cannot be used with all
options (for instance, "/h" histogram). When using wildcard specs in the
inputfile name, use the "@" or "@@" symbols and use only short file names
for the outputfile name (otherwise you end up with a fileshare violation,
and WCNT terminates).

"/x" = switches available in WCNT (all instructions are provided in a Word
document provided with the program).

Read the documentation for WCNT and play around with it. It seems to do
most of what you need. Additional analysis can be done by opening /
entering the results from WCNT in Excel, Access, or equivalents.

That's it. I either helped, or made things immensely more difficult than
they really are. I hope the former, rather than the latter, is true.
===========================

Following are the contents of the input file (Test.txt) and the output file
(TestResults.txt) I used in my testing of WCNT. All analyses were appended
to one file as per my coding of the sample batch file, but this can be
changed *see discussion of ">", ">>", "@", and "@@" above):
_________________________

*** Text from Test.txt that was analyzed with WCNT ***

Features
========

- Count of lines, characters, non-whitespace characters, words, distinct
words and unique words.

- Average length of words, distinct words and unique words.

- Sorted word lists with frequencies.

- Word length distribution histograms.

- Configurable word sets.

- DOS code page awareness.

- Multiple filespecs with wildcards.

_______________________

*** Analysis done by WCNT saved to TestResults.txt ****

WC 1.20
Processing TEST.TXT

Lines : 18
Characters : 334 ( non-whitespace : 292 )
Words : 43 ( avg. 6.14 letters )
Distinct words : 29 ( avg. 6.62 letters, 67.44 % )
Unique words : 20 ( avg. 7.20 letters, 46.51 % )

AND
AVERAGE
AWARENESS
CHARACTERS
CODE
CONFIGURABLE
COUNT
DISTINCT
DISTRIBUTION
DOS
FEATURES
FILESPECS
FREQUENCIES
HISTOGRAMS
LENGTH
LINES
LISTS
MULTIPLE
NON
OF
PAGE
SETS
SORTED
UNIQUE
WHITESPACE
WILDCARDS
WITH
WORD
WORDS

WC 1.20
Processing TEST.TXT

Lines : 18
Characters : 334 ( non-whitespace : 292 )
Words : 43 ( avg. 6.14 letters )
Distinct words : 29 ( avg. 6.62 letters, 67.44 % )
Unique words : 20 ( avg. 7.20 letters, 46.51 % )

AND 2
AVERAGE 1
AWARENESS 1
CHARACTERS 2
CODE 1
CONFIGURABLE 1
COUNT 1
DISTINCT 2
DISTRIBUTION 1
DOS 1
FEATURES 1
FILESPECS 1
FREQUENCIES 1
HISTOGRAMS 1
LENGTH 2
LINES 1
LISTS 1
MULTIPLE 1
NON 1
OF 2
PAGE 1
SETS 1
SORTED 1
UNIQUE 2
WHITESPACE 1
WILDCARDS 1
WITH 2
WORD 3
WORDS 6

WC 1.20
Processing TEST.TXT

Lines : 18
Characters : 334 ( non-whitespace : 292 )
Words : 43 ( avg. 6.14 letters )
Distinct words : 29 ( avg. 6.62 letters, 67.44 % )
Unique words : 20 ( avg. 7.20 letters, 46.51 % )

2 letters : 2 ( 4.65 %) ========-
3 letters : 4 ( 9.30 %) =================-
4 letters : 8 ( 18.60 %) ===================================-
5 letters : 9 ( 20.93 %) ========================================
6 letters : 5 ( 11.63 %) ======================
7 letters : 1 ( 2.33 %) ====
8 letters : 4 ( 9.30 %) =================-
9 letters : 3 ( 6.98 %) =============
10 letters : 4 ( 9.30 %) =================-
11 letters : 1 ( 2.33 %) ====
12 letters : 2 ( 4.65 %) ========-

WC 1.20
Processing TEST.TXT

Lines : 18
Characters : 334 ( non-whitespace : 292 )
Words : 43 ( avg. 6.14 letters )
Distinct words : 29 ( avg. 6.62 letters, 67.44 % )
Unique words : 20 ( avg. 7.20 letters, 46.51 % )

2 letters : 1 ( 3.45 %) ========
3 letters : 3 ( 10.34 %) ========================
4 letters : 5 ( 17.24 %) ========================================
5 letters : 4 ( 13.79 %) ================================
6 letters : 3 ( 10.34 %) ========================
7 letters : 1 ( 3.45 %) ========
8 letters : 3 ( 10.34 %) ========================
9 letters : 3 ( 10.34 %) ========================
10 letters : 3 ( 10.34 %) ========================
11 letters : 1 ( 3.45 %) ========
12 letters : 2 ( 6.90 %) ================
 
did said:
I have a hundred short stories of varying length which I would like to
subject to word frequency analysis.

Is there a freeware programme which allows me to:

a) list the words in each story in alphabetical order

b) list the words in each story by word frequency (as a percentage of the
total number of words)

c) automatically compare the frequencies of recurring words

d) append tags to words, so that I can do a statistical breakdown of the
grammatical styles of the stories (eg count the numbers of
adjectives/adverbs/nouns etc and their proximities relative to each other)

EDEB.

Wurdz will do at least some of this - I haven't looked at it too closely
though - http://adwt.com/pc/wurdz.htm

--
MinMin

"Why do we use answering machines to screen calls and then have call
waiting so we won't miss a call from someone we didn't want to talk to in
the first place?"
 
Back
Top