E
Esmond Hart
I am using ADO.NET to create a gazetteer from use by a
team of botanists requiring to map species distributions
world wide. The intention is to place the common Atlas
database under MSDE 2000 on a server which will initially
be shared by a group on a LAN. We are using VB.NET 2003
and MSDE from SP3 for SQL 2000.
The gazetteer is required because 60% of the specimens to
be mapped have a location description but no latitude and
longitude, so a lot of work involves finding out where the
specimen came from before plotting it. The gazetteer
contains a GeoName table of 7.1M names downloaded from the
public domain GeoNet Name Server and Geographical Name
Information System. There is also a concordance table of
13M words. For example the Atlas contains 13 names
containing 'Cape Town' You can find this out in just 0.7
seconds even though there are 8099 names with the
word 'Town' and 5045 with the word 'Cape'. The concordance
is used to lookup just the Id's of the GEONames which are
read into a Memory Stream by an SQLDataReader. The join or
intersection of the two sets of Ids takes place on the
client. The GeoName information is aquired in 20 row
chunks only when needed. The result is that the load
placed on the server by the individual users at any time
is small; MSDE and ADO.NET seem a fine fit for this
problem.
Another thing the users need from the concordance is what
we term _stem_ searches. The problem is that spellings
have changed over the years and sometimes herbarium
speciment labels are illegible. So we let users hunt for
places like 'cay*'. (This is converted internally to
LIKE 'cay%'in the stored procedure.) There are 5870 places
with a word starting 'cay' in the world atlas (found in
about 0.05 seconds) and 570 starting with 'caz' taking
less time than this machine will measure. The first (ie
the furthest North and West is)is Čazekætnjar'ga in Norway.
Note that single letter stems are disallowed for
performance reasons - we don't want to scan 20% of the
table for e%. Also, because you have to have a two or more
letter stem and that's your lot we avoid all the
complexities of the LIKE operator which can lead to the
server ignoring indexing and scanning the whole table.
Yesterday, before I implemented the horrible hack
described later, the search for words beginning with 'caz'
took over 30 seconds at which time the problem command was
terminated. Experimentation showed that searching for any
word starting with a stem ending in z had the same effect.
There had been a lot of testing of stem searching but no
one had used a stem ending in z.
The concordance table is created initially by this SQL:
CREATE TABLE [Concordance] (
[Word] [char] (8) COLLATE Latin1_General_CI_AS NOT
NULL ,
[GID] [int] NOT NULL ,
[X] [int] NOT NULL ,
[Y] [int] NOT NULL
The table is then loaded from file using BULK INSERT.
Finally the index is created with
CREATE UNIQUE CLUSTERED INDEX IX_Prime
ON [Concordance] (WORD,Y DESC, X,GID)
WITH FILLFACTOR = 100
The table is read only. The 'Words' it contains are
between 1 and 8 characters which are either lower case a-z
or the # sign. The GID field is the ID of the GeoName and
X and Y are longitude and latitude in seconds respectively.
The relevant fragment of the SQL used for a stem search is:
SELECT GID
FROM Concordances{0}
WHERE (Y <= @Y0) AND (Y >= @Y1) -- Always true in World
AND (X >= @X0) AND(X <= @x1) -- searches
AND (Word Like @Word + '%')
ORDER BY Y DESC, X, GID
To fix the problem I changed the collation sequence from
the database default, as above, to Latin1_General_CS_AS,
then Latin_General_BIN. Also I tried using upper case
letters instead of lower case letters in the search words.
In all cases the result was just the same. Stem searches
for sets of letters ending in z were disastrous; all the
rest worked fine.
Every time I made a change it cost me an unproductive hour
whilst my development machine glowed cherry red with the
effort of rebuilding 1.18GB of gazetteer. Having slept on
the problem I introduced a very nasty fix! It involved
modifying two lines of code - and of course rebuilding the
gazetteer :-(
Once change replaced the letter z with the @ sign whenever
a new concordance record was created prior to bulk insert.
The second change was at run time, where the z was
replaced by @ in the @Word parameter value. The change is
completely invisible to the end user; all she notices is
that stem searches ending in z are OK now. This only works
because the collating order of the concordance isn't
really important to our software, all we want to do is
search for words that are either equal to @Word or like
@Word +'%'.
So, is this a well known problem in which case my
apologies for wasting your time but please cite references
so I can fix it in the approved way. Or is it what a
previous boss used to refer to as a 'lacunae' - better
known to a less literate generation as a bug?
Thanks for your help.
team of botanists requiring to map species distributions
world wide. The intention is to place the common Atlas
database under MSDE 2000 on a server which will initially
be shared by a group on a LAN. We are using VB.NET 2003
and MSDE from SP3 for SQL 2000.
The gazetteer is required because 60% of the specimens to
be mapped have a location description but no latitude and
longitude, so a lot of work involves finding out where the
specimen came from before plotting it. The gazetteer
contains a GeoName table of 7.1M names downloaded from the
public domain GeoNet Name Server and Geographical Name
Information System. There is also a concordance table of
13M words. For example the Atlas contains 13 names
containing 'Cape Town' You can find this out in just 0.7
seconds even though there are 8099 names with the
word 'Town' and 5045 with the word 'Cape'. The concordance
is used to lookup just the Id's of the GEONames which are
read into a Memory Stream by an SQLDataReader. The join or
intersection of the two sets of Ids takes place on the
client. The GeoName information is aquired in 20 row
chunks only when needed. The result is that the load
placed on the server by the individual users at any time
is small; MSDE and ADO.NET seem a fine fit for this
problem.
Another thing the users need from the concordance is what
we term _stem_ searches. The problem is that spellings
have changed over the years and sometimes herbarium
speciment labels are illegible. So we let users hunt for
places like 'cay*'. (This is converted internally to
LIKE 'cay%'in the stored procedure.) There are 5870 places
with a word starting 'cay' in the world atlas (found in
about 0.05 seconds) and 570 starting with 'caz' taking
less time than this machine will measure. The first (ie
the furthest North and West is)is Čazekætnjar'ga in Norway.
Note that single letter stems are disallowed for
performance reasons - we don't want to scan 20% of the
table for e%. Also, because you have to have a two or more
letter stem and that's your lot we avoid all the
complexities of the LIKE operator which can lead to the
server ignoring indexing and scanning the whole table.
Yesterday, before I implemented the horrible hack
described later, the search for words beginning with 'caz'
took over 30 seconds at which time the problem command was
terminated. Experimentation showed that searching for any
word starting with a stem ending in z had the same effect.
There had been a lot of testing of stem searching but no
one had used a stem ending in z.
The concordance table is created initially by this SQL:
CREATE TABLE [Concordance] (
[Word] [char] (8) COLLATE Latin1_General_CI_AS NOT
NULL ,
[GID] [int] NOT NULL ,
[X] [int] NOT NULL ,
[Y] [int] NOT NULL
The table is then loaded from file using BULK INSERT.
Finally the index is created with
CREATE UNIQUE CLUSTERED INDEX IX_Prime
ON [Concordance] (WORD,Y DESC, X,GID)
WITH FILLFACTOR = 100
The table is read only. The 'Words' it contains are
between 1 and 8 characters which are either lower case a-z
or the # sign. The GID field is the ID of the GeoName and
X and Y are longitude and latitude in seconds respectively.
The relevant fragment of the SQL used for a stem search is:
SELECT GID
FROM Concordances{0}
WHERE (Y <= @Y0) AND (Y >= @Y1) -- Always true in World
AND (X >= @X0) AND(X <= @x1) -- searches
AND (Word Like @Word + '%')
ORDER BY Y DESC, X, GID
To fix the problem I changed the collation sequence from
the database default, as above, to Latin1_General_CS_AS,
then Latin_General_BIN. Also I tried using upper case
letters instead of lower case letters in the search words.
In all cases the result was just the same. Stem searches
for sets of letters ending in z were disastrous; all the
rest worked fine.
Every time I made a change it cost me an unproductive hour
whilst my development machine glowed cherry red with the
effort of rebuilding 1.18GB of gazetteer. Having slept on
the problem I introduced a very nasty fix! It involved
modifying two lines of code - and of course rebuilding the
gazetteer :-(
Once change replaced the letter z with the @ sign whenever
a new concordance record was created prior to bulk insert.
The second change was at run time, where the z was
replaced by @ in the @Word parameter value. The change is
completely invisible to the end user; all she notices is
that stem searches ending in z are OK now. This only works
because the collating order of the concordance isn't
really important to our software, all we want to do is
search for words that are either equal to @Word or like
@Word +'%'.
So, is this a well known problem in which case my
apologies for wasting your time but please cite references
so I can fix it in the approved way. Or is it what a
previous boss used to refer to as a 'lacunae' - better
known to a less literate generation as a bug?
Thanks for your help.