Sorting posers

  • Thread starter Thread starter David Trimboli
  • Start date Start date
D

David Trimboli

I've got a couple of sorting challenges. Maybe they're easy, maybe they're
well-documented, but I haven't seen the answers yet. I'm looking for
solutions that involve no third-party products, and avoiding the use of
temporary files is good, but not necessary. Single-command answers are
best, but batch scripts are acceptable.

I've got a list of names in a text file. The names are not sorted in any
way. They all consist of at least two space-delimited tokens, but may have
more. For instance:

John Doe
Jane Doe
William C. Smith
Joe de la Someone
Frank Alpha
James T. Kirk
[...]

PROBLEM 1: Assuming this is a long list, what command can be used to
identify any duplicate names in the list? (Duplicate names will always be
totally identical. John Doe and John B. Doe are not duplicate names.)

PROBLEM 2: What command will sort this list by last name?

David
Stardate 4598.3

P.S.: This isn't important, but it will satisfy questions I have about
sorting text files in Windows. It reflects a real problem I encountered
yesterday.
 
David Trimboli said:
I've got a couple of sorting challenges. Maybe they're easy, maybe
they're well-documented, but I haven't seen the answers yet. I'm
looking for solutions that involve no third-party products, and
avoiding the use of temporary files is good, but not necessary.
Single-command answers are best, but batch scripts are acceptable.
Hi David,
IMO without 3rd party tools a batch *is* required. (sed might allow a
one liner).
I've got a list of names in a text file. The names are not sorted in
any way. They all consist of at least two space-delimited tokens, but
may have more. For instance:

John Doe
Jane Doe
William C. Smith
Joe de la Someone
Frank Alpha
James T. Kirk
[...]

PROBLEM 1: Assuming this is a long list, what command can be used to
identify any duplicate names in the list? (Duplicate names will
always be totally identical. John Doe and John B. Doe are not
duplicate names.)
I'd presort and use only lines different from the previous.
PROBLEM 2: What command will sort this list by last name?
I see a possible problem in distinguishing between lastnames smith and
"de la Someone" otherwise the last word of a line can be otbained with
shifting in a sub.
P.S.: This isn't important, but it will satisfy questions I have about
sorting text files in Windows. It reflects a real problem I
encountered yesterday.

==screen=copy========================================================
C:\test>type test.txt
John Doe
Jane Doe
William C. Smith
Joe de la Someone
Frank Alpha
James T. Kirk
John Doe
Jane Doe
William C. Smith
Joe de la Someone
Frank Alpha
James T. Kirk

C:\test>DavidT001.cmd
Frank Alpha
Jane Doe
John Doe
James T. Kirk
William C. Smith
Joe de la Someone

C:\test>
==screen=copy========================================================

::DavidT001.cmd::::::::::::::::::::::::::::::::::::::::::::::::::::::
@echo off
setlocal enabledelayedexpansion
set "last="
set ifile=c:\test\test.txt
if "%~1" NEQ "" set ifile=%~1
set tfile=%temp%\%random%_%random%.txt
if exist "%tfile%" del /Q "%tfile%" >nul

for /F "delims=" %%A in ('sort "%ifile%"') do (set line=%%A
IF "%%A" NEQ "!last!" call :sub %%A)
for /F "tokens=1,*" %%A in ('sort "%tfile%"') do echo/%%B %%A

goto :eof
:sub
if "%~1" NEQ "" set "name=%~1"&shift&goto :sub
set "last=%line%"
echo/%name% !last: %name%=!>>"%tfile%"
::DavidT001.cmd::::::::::::::::::::::::::::::::::::::::::::::::::::::


HTH
 
David Trimboli said:
I've got a couple of sorting challenges. Maybe they're easy, maybe they're
well-documented, but I haven't seen the answers yet. I'm looking for
solutions that involve no third-party products, and avoiding the use of
temporary files is good, but not necessary. Single-command answers are
best, but batch scripts are acceptable.

I've got a list of names in a text file. The names are not sorted in any
way. They all consist of at least two space-delimited tokens, but may have
more. For instance:

John Doe
Jane Doe
William C. Smith
Joe de la Someone
Frank Alpha
James T. Kirk
[...]

PROBLEM 1: Assuming this is a long list, what command can be used to
identify any duplicate names in the list? (Duplicate names will always be
totally identical. John Doe and John B. Doe are not duplicate names.)

Matthias' suggestion of sorting and then comparing each line with its
predecessor is probably the best. Unfortunately, this can be a bit of a
chore in batch for a newbie (if that is what you are).
PROBLEM 2: What command will sort this list by last name?

SORT.EXE is a very rudimentary utility. The default sorting field is the
entire line, and the only option is to start that field at a fixed offset
from the beginning of the line. Your surnames are identified as the last
token in the line, and not specifically at a fixed offset.

Two suggestions:

a) use vbscript instead. OK, this is simply my preference, because vbscript
is a better tool for this type of character manipulation (imho, anyway).

b) write a batch script that reads in the file line-by-line. For each line
it identifies the x number of tokens, then writes them out to a second file
with the last token preceding the first. SORT.EXE will then sort this second
file in alphabetical order as is done in the phone book.

If you need to display the sorted data in the original name order (first
middle last), that will be a bit tricky, but is possible.

/Al
 
I've got a couple of sorting challenges. Maybe they're easy, maybe they're
well-documented, but I haven't seen the answers yet. I'm looking for
solutions that involve no third-party products, and avoiding the use of
temporary files is good, but not necessary. Single-command answers are
best, but batch scripts are acceptable.

I've got a list of names in a text file. The names are not sorted in any
way. They all consist of at least two space-delimited tokens, but may have
more. For instance:

John Doe
Jane Doe
William C. Smith
Joe de la Someone
Frank Alpha
James T. Kirk
[...]

PROBLEM 1: Assuming this is a long list, what command can be used to
identify any duplicate names in the list? (Duplicate names will always be
totally identical. John Doe and John B. Doe are not duplicate names.)

PROBLEM 2: What command will sort this list by last name?

David
Stardate 4598.3

P.S.: This isn't important, but it will satisfy questions I have about
sorting text files in Windows. It reflects a real problem I encountered
yesterday.
Try as I might, I couldn't get it down to one command<G>
The rules are:
The first string is the begiing of the first name.
The next string will be added to the first name if it ends in . (period).
The last name is the first string that doesn't end in a period, starting with
the 2nd string.

I tested this and couldn't get it to fail.

@echo off
if {%1}=={} @echo Syntax: Dups Filename&goto :EOF
if not exist %1 @echo Dups - %1 NOT found.&goto :EOF
setlocal
set file=%1
if exist "%TEMP%\Dups1.tmp" del /q "%TEMP%\Dups1.tmp"
for /f "Tokens=1-8" %%a in ('type %file%') do (
call :name %%a %%b %%c %%d %%e %%f %%g %%h
)
set pfmn=#
set pln=#
del /q %file%
sort "%TEMP%\Dups1.tmp" /O "%TEMP%\Dups2.tmp"
for /f "Tokens=1* Delims=," %%a in ('type "%TEMP%\Dups2.tmp"') do (
set fmn=%%b
set ln=%%a
call :dup
)
del /q "%TEMP%\Dups1.tmp"
del /q "%TEMP%\Dups2.tmp"
endlocal
goto :EOF
:name
set fmn=%1
set ln=
:name1
shift
if {%1}=={} goto err
set work1=%1
set work2=%work1:.=%
if "%work1%" EQU "%work2%" goto last1
set fmn=%fmn% %1
goto name1
:last1
set ln=%1
:last2
shift
if {%1}=={} goto out1
set ln=%ln% %1
goto last2
:out1
@echo %ln%,%fmn%>>"%TEMP%\Dups1.tmp"
goto :EOF
:dup
if "%fmn%" NEQ "%pfmn%" goto out2
if "%ln%" NEQ "%pln%" goto out2
@echo %fmn% %ln% is duplicate.
goto :EOF
:out2
set pfmn=%fmn%
set pln=%ln%
@echo %fmn% %ln%>>%file%
goto :EOF
:err
@echo Cannot parse name - %fmn% %ln% %1 %2 %3 %4 %5 %6 %7

Jerold Schulman
Windows: General MVP
JSI, Inc.
http://www.jsiinc.com
 
The script will fail if you don't follow the rules:

'Jerry lee schulman' will fail.
'Jerry l. schulman' will work, as will 'Jerry lee. schulman'

I've got a couple of sorting challenges. Maybe they're easy, maybe they're
well-documented, but I haven't seen the answers yet. I'm looking for
solutions that involve no third-party products, and avoiding the use of
temporary files is good, but not necessary. Single-command answers are
best, but batch scripts are acceptable.

I've got a list of names in a text file. The names are not sorted in any
way. They all consist of at least two space-delimited tokens, but may have
more. For instance:

John Doe
Jane Doe
William C. Smith
Joe de la Someone
Frank Alpha
James T. Kirk
[...]

PROBLEM 1: Assuming this is a long list, what command can be used to
identify any duplicate names in the list? (Duplicate names will always be
totally identical. John Doe and John B. Doe are not duplicate names.)

PROBLEM 2: What command will sort this list by last name?

David
Stardate 4598.3

P.S.: This isn't important, but it will satisfy questions I have about
sorting text files in Windows. It reflects a real problem I encountered
yesterday.
Try as I might, I couldn't get it down to one command<G>
The rules are:
The first string is the begiing of the first name.
The next string will be added to the first name if it ends in . (period).
The last name is the first string that doesn't end in a period, starting with
the 2nd string.

I tested this and couldn't get it to fail.

@echo off
if {%1}=={} @echo Syntax: Dups Filename&goto :EOF
if not exist %1 @echo Dups - %1 NOT found.&goto :EOF
setlocal
set file=%1
if exist "%TEMP%\Dups1.tmp" del /q "%TEMP%\Dups1.tmp"
for /f "Tokens=1-8" %%a in ('type %file%') do (
call :name %%a %%b %%c %%d %%e %%f %%g %%h
)
set pfmn=#
set pln=#
del /q %file%
sort "%TEMP%\Dups1.tmp" /O "%TEMP%\Dups2.tmp"
for /f "Tokens=1* Delims=," %%a in ('type "%TEMP%\Dups2.tmp"') do (
set fmn=%%b
set ln=%%a
call :dup
)
del /q "%TEMP%\Dups1.tmp"
del /q "%TEMP%\Dups2.tmp"
endlocal
goto :EOF
:name
set fmn=%1
set ln=
:name1
shift
if {%1}=={} goto err
set work1=%1
set work2=%work1:.=%
if "%work1%" EQU "%work2%" goto last1
set fmn=%fmn% %1
goto name1
:last1
set ln=%1
:last2
shift
if {%1}=={} goto out1
set ln=%ln% %1
goto last2
:out1
@echo %ln%,%fmn%>>"%TEMP%\Dups1.tmp"
goto :EOF
:dup
if "%fmn%" NEQ "%pfmn%" goto out2
if "%ln%" NEQ "%pln%" goto out2
@echo %fmn% %ln% is duplicate.
goto :EOF
:out2
set pfmn=%fmn%
set pln=%ln%
@echo %fmn% %ln%>>%file%
goto :EOF
:err
@echo Cannot parse name - %fmn% %ln% %1 %2 %3 %4 %5 %6 %7

Jerold Schulman
Windows: General MVP
JSI, Inc.
http://www.jsiinc.com


Jerold Schulman
Windows: General MVP
JSI, Inc.
http://www.jsiinc.com
 
My thanks to everyone who responded. Matthias' script is very good for
duplication-checking. Jerold's script is a bit confusing, but he correctly
identifies the rules for parsing the text file.

I think the ultimate answer is to put my foot down and rewrite the text file
first. The first token will be the given name. Everything after that will
be considered the surname, and middle initials and names will be dumped.
Sorry, James T. Kirk, but for sorting purposes you're just James Kirk. It's
a simple matter to find all the lines with more than two tokens; just
operate on those.

With that out of the way, it should be a (relatively) simple matter to sort
by last name. Just use a "for /f ...%A" to write a new text file in the
format "%* %A", sort this file, and if I'm really feeling adventurous I'd
find a way to reformat the list into First Last again.

Thanks!

David
Stardate 4599.1
 
:
Hi David.
My thanks to everyone who responded. Matthias' script is very good for
duplication-checking. Jerold's script is a bit confusing, but he
correctly identifies the rules for parsing the text file.

I think the ultimate answer is to put my foot down and rewrite the
text file first. The first token will be the given name. Everything
after that will be considered the surname, and middle initials and
names will be dumped. Sorry, James T. Kirk, but for sorting purposes
you're just James Kirk. It's a simple matter to find all the lines
with more than two tokens; just operate on those.

With that out of the way, it should be a (relatively) simple matter to
sort by last name. Just use a "for /f ...%A" to write a new text file
in the format "%* %A", sort this file, and if I'm really feeling
adventurous I'd find a way to reformat the list into First Last again.

My batch lacks a bit documentation, but it should do *all* the
mentioned tasks. It removes dups, gets the last name (being the last
space seperated element) puts it in front, sorts by this changed line
and outputs again in the orginal naming order but sorted alphabetically
by last name.

Just a small flaw, I forgot to delete the temp file.

For better reading I changed last to previous and name to lastname.
For illustration I output the %ifile% before and %tfile% between the
two for loops. The well working logic isn't changed at all.
Even a columnar arrangement isn't affected by the process.

==screen=copy==========================================================
C:\test>DavidT002.cmd
John A. Doe
Jane X Doe
William C. Smith
Joe de la Someone
Frank Alpha
James T. Kirk
John B. Doe
Jane Y. Doe
William C. Smith
Joe de la Someone
Frank Alpha
James T. Kirk
========================
Alpha Frank
Kirk James T.
Doe Jane X
Doe Jane Y.
Someone Joe de la
Doe John A.
Doe John B.
Smith William C.
========================
Frank Alpha
Jane X Doe
Jane Y. Doe
John A. Doe
John B. Doe
James T. Kirk
William C. Smith
Joe de la Someone

C:\test>
==screen=copy==========================================================

::DavidT002.cmd::::::::::::::::::::::::::::::::::::::::::::::::::::::
:: Batch dedupes lines and sorts ascending by last word &rest of line
@echo off
setlocal enabledelayedexpansion
set "previous="
set ifile=c:\test\test.txt
if "%~1" NEQ "" set ifile=%~1
set tfile=%temp%\%random%_%random%.tmp
if exist "%tfile%" del /Q "%tfile%" >nul
more %ifile%&echo=========================
:: Iterate through sorted infile, call sub only if line differs from
:: the previous
for /F "delims=" %%A in ('sort "%ifile%"') do (set line=%%A
IF "%%A" NEQ "!previous!" call :sub %%A)
more %tfile%&echo=========================
:: %tfile% now contains last first middle middle .....
:: read (sorted)last and the rest and output in original order
:: If you want to have "last, first middle.." do echo/%%A, %%B
for /F "tokens=1,*" %%A in ('sort "%tfile%"') do echo/%%B %%A

del /q %tfile% >NUL
goto :eof

:: sub is called with all name parts in %1 %2 %3......
:sub
:: set lastname to %1 and shift until last
if "%~1" NEQ "" set "lastname=%~1"&shift&goto :sub
:: remember previous
set "previous=%line%"
: output lastname and whole line without lastname to temp file
echo/%lastname% !line: %lastname%=!>>"%tfile%"
::DavidT002.cmd::::::::::::::::::::::::::::::::::::::::::::::::::::::

I'll try to better comment on the first run ...
 
Back
Top