Data Warehousing

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

Ok, so I'm semi-new to .NET, having done everything manually with SQL code
back in VB6. So before I program this up completely manually again, I thought
I'd ask for better ways to think through this problem.

We have several client machines, and a central data warehousing server. Each
machine may contain hundreds of surveys, and they all are sent to the central
server. Only they can never be networked together, forcing us to use files. I
currently use an XML file to control what happens on each end (deletions,
confirmations of actions taken, configurations, and survey transfers). Only
the survey transfer part has me.

Our database model uses GUIDs for all unique keys, since each survey can
have several tens of thousands of individual data points. The problem with
this is that if I merely serialize the dataset for each survey into files, we
wind up with GUID collisions at the data warehouse, since each machine may
have hundreds of thousands of GUIDs. This forces me to loop through the
surveys being transferred, add each manually with a fresh GUID, and then add
each subcomponent of the survey in turn, all with freshly generated GUIDs.
This turns into a huge process that eats up CPU time and may take upwards of
fifteen minutes for a single larger survey. Needless to say, most people want
our old file-based DOS system back because it was "instant".

Is there a better way to perform the task at hand? I didn't think
replication or subscription models would work because we have to use files as
an intermediary. I've only ever done basic SQL operations, so I'm not sure if
there's something readily available that can basically automate the process.
And I'm new to C# and .NET framework in general, so I'm not sure if there is
something that can help me out there. Would appreciate any tips on how to
best accomplish this.

Thanks,
David
 
If these machines aren't networked together, how do they send the files to
the central data warehousing server? Unless the file is put onto a thumb
drive or some other portable medium and physically carried to the server,
they are networked together. In addition, if these machines are connected to
the Internet, they are networked together (The Internet is a vast TCP/IP
network).

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Numbskull

Hard work is a medication for which
there is no placebo.
 
Sorry for omitting that information, but the machines are normally located in
secure facilities, so have no access to either the internet or a network of
any kind. Many times they even must have any networking cards removed before
they enter the facility. We will be using USB thumb drives to transfer the
files from the machines to the server, and have a small download/upload app
at each side.

Thanks,
David
 
I do not assume to have all the facts in your environment. The first
question I have is how do you get the survey from the client machine to the
database without a network? Do you use SneakerNet? I assume that you do
have a network inplace, or you wouldn't be worried about GUIDs.
Additionally, I assume you are using SQL server or you wouldn't be worried
about GUID's.

A couple of ideas come to mind immediately as I read your post:

After thinking on it, I would consider a smart client connecting to web
services. The smart client allows the user to work offline and selectively
push/pull data from the host server. The web services allow you to cache
active activity for further processing prior to pushing the data to the
database.

As for the data model, if GUID's are working for you, stick with 'em. You
resolution is simpler than you might think.

Example data model

Surveys
pk, SurveyID GUID not null
SurveyName varchar not null
SurveyDate datetime not null (getDate())

SurveyedPeople
pk, PersonID GUID not Null
Firstname varchar not null
Lastname varchar not null

SurveyElements
pk, ElementID GUID not null
fk, SurveyID GUID not null
ElementName varchar not null
ElementResponseType int not null

SurveyResponses
fk, ElementID GUID
fk, PersonID GUID
Response varchar

1. create a dataset that represents this model, create the relationships.
2. have your SQL adapter use stored procs to process updates, deletes,
inserts.
3. send your changes from the smart client to the webservice

sample insert script

declare @NewIdentity uniqueidentifier
insert into surveys (SurveyID, SurveyName, SurveyDate) Values(NewID(),
@NewSurveyName, SurveyDate)
select @newidentity

3. As you perform inserts, the adapter will update the associated datarow
with the new guid, this will cascade down to the child tables
4. return the updated dataset from the web service
5. merge changes
6. accecpt changes


sample code

client side

sub save(ds as dataset)
dim ws as mywebservice
dim dsChanges as dataset
... connection logic....
dsData = ds.getchanges(Inserted, Updated, Deleted)
ws.save(dsData) 'Serialzation is automatic
ds.merge(dsdata)
ds.acceptchanges
end sub

webservice
<webmethod> _
sub save(ds as dataset)
surveys.update(ds.surveys) 'call to SQLDataAdapter
people.update(ds.people)
responses.update(ds.responses)
end sub

Take a look at Issue Vision from www.windowsforms.com
 
<Snip>
Sorry for omitting that information, but the machines are normally located
in
secure facilities, so have no access to either the internet or a network of
any kind. Many times they even must have any networking cards removed before
they enter the facility. We will be using USB thumb drives to transfer the
files from the machines to the server, and have a small download/upload app
at each side.
</Snip>

This solution still applies, only the transport is SneakerNet, you would
simply serialize the datasets to the hard drive. Pick them up with your
thumb drive, process them, push them back to the thumb drive, then copy them
back to the client machine. GUID or Autonumber contention is not an issue
here, especially since this process seems more serial and sequential than
random updates.

I could make an argument for parallel port, USB, or serial port
communication similar to the old laplink days, and using it to push update
in batch once per day. Certainly, that would be better than manually
touching every computer and inserting a "floppy disk"
 
Yeah, as said in my other reply above, we have no access to any networking,
between machines, or to the server. No internet, and not even an ad hoc
connection. Herein lies the problem. It'd be easy if the databases could
speak directly. :)

For future reference, these are radiation surveys, so you'll understand if I
start using weird terms that don't make sense in a regular personal survey
context.

This led to the idea of serializing each survey (located on each machine's
individual SQL server) into xml (each file would likely be huge though), and
have an xml control file that tells the server which files it needs to
upload, and the server sends a control file back that tells the client to
delete a survey (after it is processed) or update configuration information.
It also has a section for confirmations of all actions. Each side has a few
database tables keeping track of what the other (client or server) should be
doing, and the control file is generated from that.

The problem with serializing a survey was the GUID collision issue.
Currently I solve this by loading the serialized survey into a dataset on the
server, and manually navigating the dataset, reinserting the data manually
for each record, generating new GUIDs along the way. The problem with this is
that we're dealing with up to 100 strips of data per survey, each with
hundreds of data acquisitions (one per second of survey time), each
acquisition with thousands of data points, so it becomes rather tedious to do
new inserts for every record, taking a ton of time. With probably 20 surveys
per day from each machine, up to ten or so now, that winds up adding up big
time. It would take the entire day to upload all those surveys like that.

So, I don't know what the best route would be. It appears your solution is
implemented similar to my current solution, just over a webservice, which
unfortunately we don't have the ability to use. Unless I'm misinterpreting
your comments...

Appreciate the help and helping me think through this; it's a rather weird
problem since we have to use files as an intermediary.

David
 
You say that each machine has its own SQL Server? The solution is simple.
Copy the database file to the thumb drive, and then bring the database file
to the server. Attach the database file to the server, and then import the
data using a query or Stored Procedure. Don't import the GUIDs, but have the
server database use a column, such as an Identity column, that automatically
assigns a unique ID to the record. As the data is in its original binary
format, it should be very fast.

See the SQL Server Books Online for details.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Numbskull

Complex things are made up of
lots of simple things.
 
Unfortunately, a secondary problem is that our server is not located
physically near the machines, it must be in a separate building. Radioactive
environments are usually dusty and noisy, so our server and processing
station is located outside the buildings being surveyed.

A tertiary problem is that this is sensitive information and must 1) stay on
the machine until it has been processed and finalized, 2) must stay in the
data warehouse until all cleanup efforts have completed, and 3) must have the
ability to put back on the client machines even after deletion. The reason
this is a problem is that if we only had to keep the survey up long enough to
process it, there would be no GUID duplicates. And it must remain on the
machine also, until the survey is complete, processed, and marked as ok (no
contamination).

Not sure what you mean about there being no GUID contention. Currently, the
lowest level identity is an Acquisition, a member of Strip, which is a member
of Survey. There may be tens of thousands Acquisitions in a survey, sometimes
more. For a few seqential surveys on the same machine, this is a non-issue.
However, when you take into account there are ten machines, each with a
hundred surveys, each with 10000 acquisitions, your chances of duplicating
one of those Acquisition GUIDs increase by a large enough margin to matter.
So just forcing the whole dataset over without redoing all the GUIDs is not
an option. It doesn't collide on just a few surveys, but the average seems
tobe eight surveys from various machines, before one Acquisition GUID is
duplicated.

Hence my current solution of serializing them as-is, and replacing all the
GUIDs on the server side to new ones with no collision. I guess the only way
to speed that up would be first checking if any of the GUIDs actually
collide, and only go through the replacement if they do, otherwise just pop
the serialized dataset into the server's database. Only when we get to about
a thousand surveys in that data warehouse (not uncommon, our largest project
had a few hundred thousand surveys), those collisions will happen more and
more often.

Unless I'm missing the point somewhere?

Thanks again for the help,
David
 
Kevin Spencer's response is more eloquent than mine in explaining the guid
contention issue.

Essentially, the guid's used on the client side are just place holders so
that work can be completed. guid's on the server side are assigned as the
stored procedures are processed. Since the server is assigning the guid's,
the likely chance of a duplicate guid in the same table is like 2 billion to
1.

The paradigm that I proposed treats each survey as "unprocessed" work and it
remains unprocessed until the server side logic processes the data and the
data is then sent back to the client. Essentially, it is a round trip, just
a very slow round trip.

possible flow

1. create a dataset for a new survey
2. modify the data (insert, update, delete)
3. save the dataset to the disk
4. copy the file(s) to the thumb drive
5. load the file(s) from the thumb drive into the server control program
6. process the data
7. commit the data to the data warehouse
8a. save the processed data back the thumb drive
8b. save lookup and reference data to disk
9. copy the data back onto the client drive from the thumb drive
10. load the file(s) back into the client application
11. commit the data to the local client

The client code will manage the initial guid assignment on the client side.
The dataset itself, along with the stored procedures will manage the guid
assignment on the server side, ignoring the initial guids assigned by the
client.

Since the data was not initially committed to a local data store (SQL or
Access), when you store the data in the dataset on the local server, the
guids will be assigned the values the server had given them. There is no
contention on guid assignment, since the data warehouse is assigning the
guid's.

I do not think that log shipping is a viable solution here
I do not think that replicated databases are a viable solution here
I do not think that hitting the local SQL server, prior to commiting data to
the warehouse is a feesible solution

If the XML files become to large to handle via the thumb drive, there is
always csv and binary formats to work with. By using either the XML, CSV,
and binary solutions, you are only moving affected data and not the entire
database.

I hope that makes it more intelligable.
 
Actually, that may just work perfectly. I think I'll do a simple
backup/restore on each end and do what I want from there. Should be plenty
fast using stored procs. I can probably even drop that xml control file and
just store that control information in a table.

Either that, or my failsafe idea is just to store it in SQL on the clients,
and then keep it in an XML serialized files on the server, opening up one at
a time to process, transferring to DB, and then releasing when done. We'll
never need to asynchronously process files so that's a final option, but a
slightly uglier solution than we wanted. The backup/restore should work
perfectly though. Simple solution, but it was hard to see given constraints.
:) Thanks much guys.

David
 
Back
Top