Design / best practices question

  • Thread starter Thread starter trullock
  • Start date Start date
T

trullock

Hi,

Can anyone suggest the best way to go about the following...

I'm tracking clicks (mouse down x,y coordinates) on a web page by
using some javascript to create an XHR which sends the coordinates to
a recording service, such as:

/Record.ashx?X=123&Y=456

In record.ashx, I'm wondering about the best way to record all these
values to the database.

In the end, when selecting the data from the database, all i need to
know is X, Y and a count. so for example:

123,456,10 would mean there were 10 clicks at 123,456.

I'm figuring that since this service is going to be under quite high
load, that making an sqlconnection inside record.ashx for every
request would be a bad idea?

I'm wondering if i should make a static array in the record servce,
which then performs a dump at regular intervals.

I'm also trying to learn LINQ and i'm wondering if there some trickery
i can employ to make this as lightweight as possible.

As an asside, do you think its best to try and maintain and X,Y,Count
table from the outset, or to just dump clicks into a raw table, which
gets periodically parsed into the grouped table?

Thanks

Trull
 
Hi,

Can anyone suggest the best way to go about the following...

I'm tracking clicks (mouse down x,y coordinates) on a web page by
using some javascript to create an XHR which sends the coordinates to
a recording service, such as:

/Record.ashx?X=123&Y=456

In record.ashx, I'm wondering about the best way to record all these
values to the database.

In the end, when selecting the data from the database, all i need to
know is X, Y and a count. so for example:

123,456,10 would mean there were 10 clicks at 123,456.

I'm figuring that since this service is going to be under quite high
load, that making an sqlconnection inside record.ashx for every
request would be a bad idea?

First, what is "quite high load"? Are you expecting this service to be a
google?

Second, what do you think the overhead is to instantiate a SQL connection?

What I am challenging here is whether the creation of the SQL connection is
really a problem. Sure, instantiating a SQL connection is an an expensive
endeavor, but you pull from the pool anyway. Under load, the pool will
retain all of the connection objects, so the overhead, per call, is in the
millisecond range at max and more likely to be microseconds. What I am
getting at is there are other ways to scale other than avoiding putting data
in the database. You can scale vertically by getting beefier machines. You
can spread the web work on to multiple machines via a cluster. You can move
the database to its own server.

Back to the overhead to instantiate a connection issue. In ADO.NET, you do
see a lot of overhead when you have a system that only takes a hit once
every couple of minutes. This is due ot the fact the connection pool is
cleared after 30 seconds (default). On heavily used apps, ADO.NET is
constantly pulling from the connection pool, so the instantiation time is
reduced tremendously.
I'm wondering if i should make a static array in the record servce,
which then performs a dump at regular intervals.

How important is an accurate count? - a machine failure here means you are
way off, as the array simply dies. At least with a persistant method, you
will be close, even with sudden heart failure on your system.
How are you going to do the dump? - copy the array, clear the original and
then fire another thread to save? Create a new array to service calls, and
fire another thread to save? Use a multi-threaded round robin? All possible,
but add a lot of complexity to your solution. More moving parts means more
places for potential failure.

What you end up doing in situations like this is you just delay your problem
by trading overhead per click (very small) to large overhead later on. There
are certainly times to do this, but you need to ask yourself whether the
extra programming, for safety, is warranted and the likelihood of a sudden
heart failure on your machine(s).

In short, know your risks from both directions. Trading a bit of perf for a
lack of safety? Might be worth it to you; I cannot answer that.
I'm also trying to learn LINQ and i'm wondering if there some trickery
i can employ to make this as lightweight as possible.

LINQ is not a data access method, per se. It is also not the best option to
use it as such when you need to disconnect your data, as the data context
desires an open connection all the time.
As an asside, do you think its best to try and maintain and X,Y,Count
table from the outset, or to just dump clicks into a raw table, which
gets periodically parsed into the grouped table?

I assume you are working with a static image (meaning you are collecting
clicks on a single image of fixed size). If so, I see no reason to use a raw
table and parse. Consider this:

CREATE PROCEDURE TestProc
(
@x int
, @y int
, @clicks int
)
AS

UPDATE Table
SET clicks = clicks + @clicks
WHERE x = @x
AND y = @y

This would allow you to do either of the ways you are talking about (count
click by click or offline for later).

--
Gregory A. Beamer
MVP, MCP: +I, SE, SD, DBA

Subscribe to my blog
http://feeds.feedburner.com/GregoryBeamer#

or just read it:
http://feeds.feedburner.com/GregoryBeamer

********************************************
| Think outside the box! |
********************************************
 
First, what is "quite high load"? Are you expecting this service to be a
google?

Second, what do you think the overhead is to instantiate a SQL connection?

What I am challenging here is whether the creation of the SQL connection is
really a problem. Sure, instantiating a SQL connection is an an expensive
endeavor, but you pull from the pool anyway. Under load, the pool will
retain all of the connection objects, so the overhead, per call, is in the
millisecond range at max and more likely to be microseconds. What I am
getting at is there are other ways to scale other than avoiding putting data
in the database. You can scale vertically by getting beefier machines. You
can spread the web work on to multiple machines via a cluster. You can move
the database to its own server.

Back to the overhead to instantiate a connection issue. In ADO.NET, you do
see a lot of overhead when you have a system that only takes a hit once
every couple of minutes. This is due ot the fact the connection pool is
cleared after 30 seconds (default). On heavily used apps, ADO.NET is
constantly pulling from the connection pool, so the instantiation time is
reduced tremendously.


How important is an accurate count? - a machine failure here means you are
way off, as the array simply dies. At least with a persistant method, you
will be close, even with sudden heart failure on your system.
How are you going to do the dump? - copy the array, clear the original and
then fire another thread to save? Create a new array to service calls, and
fire another thread to save? Use a multi-threaded round robin? All possible,
but add a lot of complexity to your solution. More moving parts means more
places for potential failure.

What you end up doing in situations like this is you just delay your problem
by trading overhead per click (very small) to large overhead later on. There
are certainly times to do this, but you need to ask yourself whether the
extra programming, for safety, is warranted and the likelihood of a sudden
heart failure on your machine(s).

In short, know your risks from both directions. Trading a bit of perf fora
lack of safety? Might be worth it to you; I cannot answer that.


LINQ is not a data access method, per se. It is also not the best option to
use it as such when you need to disconnect your data, as the data context
desires an open connection all the time.


I assume you are working with a static image (meaning you are collecting
clicks on a single image of fixed size). If so, I see no reason to use a raw
table and parse. Consider this:

CREATE PROCEDURE TestProc
(
    @x    int
    , @y    int
    , @clicks    int
)
AS

UPDATE Table
    SET clicks = clicks + @clicks
WHERE x = @x
AND y = @y

This would allow you to do either of the ways you are talking about (count
click by click or offline for later).

--
Gregory A. Beamer
MVP, MCP: +I, SE, SD, DBA

Subscribe to my bloghttp://feeds.feedburner.com/GregoryBeamer#

or just read it:http://feeds.feedburner.com/GregoryBeamer

********************************************
| Think outside the box!                               |
********************************************

Hi Gregory,

thanks for your feedback.

You made several relevant points; about vertical scaleability and
count accuracy.

What i'm doing is currently just a proof of concept to see how things
go. If load is getting too high, getting better machines may be an
option along with you other suggestions. However initially, i just
wanted to make sure i'd taken the best route possible in the first
place.

If you don't see a problem with using a connection and command to do a
single insert per click, then thats cool. I'll give it a whirl and see
how it holds up :)

The image isn't a fixed size, which means i either need to:

a) always insert, then: select x, y, count(1) group by x, y
a) always insert into a holding table, then batch update a table with
a count field periodically, and always select x,y, clicks from the
aggregated table
b) check if it exists, if so, update, else insert, then: select
x,y,clicks

(also there will be multiple images running at the same time, so all
the data has an imageid associated with it.)

I'm just trying to learn where the performance will be won or lost :)

Thanks for your advice :)

Trull
http://www.muonlab.com
 
What i'm doing is currently just a proof of concept to see how things
go.

Welcome to my world! :-)
If load is getting too high, getting better machines may be an
option along with you other suggestions. However initially, i just
wanted to make sure i'd taken the best route possible in the first
place.

Best route, at first, is to code the simplest way you can, which is to
insert on each click. You then create a click simulator and pound the
backend and see where the system breaks down. I believe you will find it is
much higher than you think. The next test is to add stress on the web at the
same level, while you hammer the back end. Does not necessarily have to be
clicks through the UI, but you can do this. The idea here is just to see
where the single box breaks, with the added load of pages being served. This
will be lower than the first shot.

Most likely, you will run these tests on your dev box, which sucks compared
to the server(s) (hopefully (s) instead of not) you will put the solution
on, but will give you a worst case benchmark, which is good to have. If the
line is above your final solution, you are golden.
If you don't see a problem with using a connection and command to do a
single insert per click, then thats cool. I'll give it a whirl and see
how it holds up :)

If the simplest solution does not perform to where you feel you need it,
then you have multiple options.

1. Move to beefier boxes
2. Employ a multi-threaded solution
3. Add a service that handles the clicks off line (safer than merely loading
in a list)
etc.
The image isn't a fixed size, which means i either need to:
a) always insert, then: select x, y, count(1) group by x, y
a) always insert into a holding table, then batch update a table with
a count field periodically, and always select x,y, clicks from the
aggregated table
b) check if it exists, if so, update, else insert, then: select
x,y,clicks

There are ways to handle this. ONe is to gather metadata from images you
will serve when they are loaded into the application. This will give you the
parameters.

Just out of curiosity, are you working with maps? Does not matter, as the
algorithm is essentially the same, but I do a lot with maps.
(also there will be multiple images running at the same time, so all
the data has an imageid associated with it.)

Then the metadata route makes a lot of sense.
I'm just trying to learn where the performance will be won or lost :)

You are probably going to hammer your SQL Server box first. But, pulling all
of the work on the web server, where companies generally skimp, does not
work in most cases. It is better to cluster the SQL boxes and/or move the
actual platters to a high speed NAS or SAN (which is fairly standard for
clustering). The end result is the SQL box hums.

--
Gregory A. Beamer
MVP, MCP: +I, SE, SD, DBA

Subscribe to my blog
http://feeds.feedburner.com/GregoryBeamer#

or just read it:
http://feeds.feedburner.com/GregoryBeamer

********************************************
| Think outside the box! |
********************************************
 
Back
Top