web pages cacheing

  • Thread starter Thread starter Sunny
  • Start date Start date
S

Sunny

Hi,
I need to download some web pages from a web server (most of them are
not static html, but .asp pages).
The content (I.e. the pure html) in the pages are not changed very
often, so I'll looking for a way not to download that pages, which have
not been changed since last download.
Can someone point me an starting point in that direction? Any examples,
etc.

Thanks
Sunny
 
Hi Sunny,

Thanks for posting in this group.
Can you tell me which way you use to download web pages from web server?
Use ftp? Or just setup links for these pages for client to download?
If it is possible, I think source control is a good way for version control.

Thanks

Best regards,
Jeffrey Tan
Microsoft Online Partner Support
Get Secure! - www.microsoft.com/security
This posting is provided "as is" with no warranties and confers no rights.

--------------------
| From: Sunny <[email protected]>
| Subject: web pages cacheing
| Date: Thu, 6 Nov 2003 14:15:38 -0600
| Organization: Iceberg Wireless LLC
| MIME-Version: 1.0
| Content-Type: text/plain; charset="iso-8859-15"
| Content-Transfer-Encoding: 7bit
| X-Newsreader: MicroPlanet Gravity v2.60
| Message-ID: <[email protected]>
| Newsgroups: microsoft.public.dotnet.languages.csharp
| NNTP-Posting-Host: 216.17.90.91
| Lines: 1
| Path: cpmsftngxa06.phx.gbl!TK2MSFTNGP08.phx.gbl!TK2MSFTNGP12.phx.gbl
| Xref: cpmsftngxa06.phx.gbl microsoft.public.dotnet.languages.csharp:197291
| X-Tomcat-NG: microsoft.public.dotnet.languages.csharp
|
| Hi,
| I need to download some web pages from a web server (most of them are
| not static html, but .asp pages).
| The content (I.e. the pure html) in the pages are not changed very
| often, so I'll looking for a way not to download that pages, which have
| not been changed since last download.
| Can someone point me an starting point in that direction? Any examples,
| etc.
|
| Thanks
| Sunny
|
 
Hi Jeffrey,

v- said:
Hi Sunny,

Thanks for posting in this group.
Can you tell me which way you use to download web pages from web server?
Use ftp? Or just setup links for these pages for client to download?
If it is possible, I think source control is a good way for version control.

Thanks

Best regards,
Jeffrey Tan
Microsoft Online Partner Support
Get Secure! - www.microsoft.com/security
This posting is provided "as is" with no warranties and confers no rights.


I'm implementing a client for disconnected devices (notebooks, PDAs,
etc.).
This client (when connected) have to download some pages for off-line
viewing. The client connects to a server, which among other things, gets
the necessary pages, zips them and sends them to client. So, I'm looking
for a way, how to get only changed pages.
Client can send the server some info about existing pages, so server
will know how to compare if a page have been changed and to fetch it.
What I'm missing is what exact info client have to send, and how server
will check if pages are changed before fully download them and process
them.
My current implementation makes MD5 checksum of the pages, after that
server downloads all pages, and then sends to the client only changed
one. But I want to avoid unreasonable downloading at all.
I do not have any knowledge on how proxy caches are working, so I need
some info to implement something similar.

Thanks
Sunny
 
Hi Sunny,

Thanks for your feedback.
Based on my understanding, your problem is after the server got all the
pages, how can you determine which pages should be sent to client.(modified)
I think in your client side, you can use the File.GetLastWriteTimeUtc()
method to get the last modified time, then you can send these information
to the server.
The server side, can compare the last modified time property of pages to
determine which pages to zip and send to client.

If I misunderstand you, please feel free to let me know.

Best regards,
Jeffrey Tan
Microsoft Online Partner Support
Get Secure! - www.microsoft.com/security
This posting is provided "as is" with no warranties and confers no rights.

--------------------
| From: Sunny <[email protected]>
| Subject: RE: web pages cacheing
| Date: Fri, 7 Nov 2003 08:29:49 -0600
| References: <[email protected]>
<[email protected]>
| Organization: Iceberg Wireless LLC
| MIME-Version: 1.0
| Content-Type: text/plain; charset="iso-8859-15"
| Content-Transfer-Encoding: 7bit
| X-Newsreader: MicroPlanet Gravity v2.60
| Message-ID: <[email protected]>
| Newsgroups: microsoft.public.dotnet.languages.csharp
| NNTP-Posting-Host: 216.17.90.91
| Lines: 1
| Path: cpmsftngxa06.phx.gbl!TK2MSFTNGP08.phx.gbl!TK2MSFTNGP12.phx.gbl
| Xref: cpmsftngxa06.phx.gbl microsoft.public.dotnet.languages.csharp:197443
| X-Tomcat-NG: microsoft.public.dotnet.languages.csharp
|
| Hi Jeffrey,
|
| In article <[email protected]>, v-
| (e-mail address removed) says...
| >
| > Hi Sunny,
| >
| > Thanks for posting in this group.
| > Can you tell me which way you use to download web pages from web server?
| > Use ftp? Or just setup links for these pages for client to download?
| > If it is possible, I think source control is a good way for version
control.
| >
| > Thanks
| >
| > Best regards,
| > Jeffrey Tan
| > Microsoft Online Partner Support
| > Get Secure! - www.microsoft.com/security
| > This posting is provided "as is" with no warranties and confers no
rights.
|
|
| I'm implementing a client for disconnected devices (notebooks, PDAs,
| etc.).
| This client (when connected) have to download some pages for off-line
| viewing. The client connects to a server, which among other things, gets
| the necessary pages, zips them and sends them to client. So, I'm looking
| for a way, how to get only changed pages.
| Client can send the server some info about existing pages, so server
| will know how to compare if a page have been changed and to fetch it.
| What I'm missing is what exact info client have to send, and how server
| will check if pages are changed before fully download them and process
| them.
| My current implementation makes MD5 checksum of the pages, after that
| server downloads all pages, and then sends to the client only changed
| one. But I want to avoid unreasonable downloading at all.
| I do not have any knowledge on how proxy caches are working, so I need
| some info to implement something similar.
|
| Thanks
| Sunny
|
 
Hi Jeffret,
Thanks for the post.
Yes, you understand me correctly, and I do something similar now. But I
wanted something more elegant, I.e. not to download pages at server also
:). It seems that I have to read about HTTP headers, etc. and how proxy
cache servers do this.

Currently I'm using WebClient class, but in order to get the response
headers, I have to read all the page. At least I couldn't find a method
of WebClinet, which just sends a request and receives only the headers.

Is there any other framework's class which I can use for that?
Or I have to implement my own connection class?

Thanks
Sunny
 
Hi again,
I've just found something, which makes me more confusing:

in MSDN about HttpWebRequest.IfModifiedSince Property there is an
example:

// Create a new 'Uri' object with the mentioned string.
Uri myUri =new Uri("http://www.contoso.com");
// Create a new 'HttpWebRequest' object with the above 'Uri' object.
HttpWebRequest myHttpWebRequest= (HttpWebRequest)WebRequest.Create
(myUri);
// Create a new 'DateTime' object.
DateTime today= DateTime.Now;
if (DateTime.Compare(today,myHttpWebRequest.IfModifiedSince)==0)
.....

Does it mean, that WebRequest.Create method creates a connection to the
server to get headers (they are checking the header before GetResponse)?
I can not see anything like this in the docs for WebRequest.Create.

So, what that example mean? How does all these classes work: WebClient,
WebRequest, HttpWebRequest?

I know that http connection has 2 passes: client connects to the server,
server responds that is ready and after that client sends CONTINUE to
get the page. So, is WebRequest.Create does that first pass only? I can
not find any docs about it.

That's what I need, to get response headers without retrieving the page.
As I am not expert in http, I may be wrong, so please correct me.

Thanks
Sunny
 
Hi Sunny,

I have done some research for you. I have tested that sample.
The IfModifiedSince property can actually get the last modified time.
But I still can not make sure if this will actually get the content of the
page.
I will do some research for you. I will reply to you ASAP.
Thanks for your understanding

Best regards,
Jeffrey Tan
Microsoft Online Partner Support
Get Secure! - www.microsoft.com/security
This posting is provided "as is" with no warranties and confers no rights.

--------------------
| From: Sunny <[email protected]>
| Subject: RE: web pages cacheing
| Date: Tue, 11 Nov 2003 17:07:10 -0600
| References: <[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
| Organization: Iceberg Wireless LLC
| MIME-Version: 1.0
| Content-Type: text/plain; charset="iso-8859-15"
| Content-Transfer-Encoding: 7bit
| X-Newsreader: MicroPlanet Gravity v2.60
| Message-ID: <[email protected]>
| Newsgroups: microsoft.public.dotnet.languages.csharp
| NNTP-Posting-Host: 216.17.90.91
| Lines: 1
| Path: cpmsftngxa06.phx.gbl!TK2MSFTNGP08.phx.gbl!tk2msftngp13.phx.gbl
| Xref: cpmsftngxa06.phx.gbl microsoft.public.dotnet.languages.csharp:198528
| X-Tomcat-NG: microsoft.public.dotnet.languages.csharp
|
| Hi again,
| I've just found something, which makes me more confusing:
|
| in MSDN about HttpWebRequest.IfModifiedSince Property there is an
| example:
|
| // Create a new 'Uri' object with the mentioned string.
| Uri myUri =new Uri("http://www.contoso.com");
| // Create a new 'HttpWebRequest' object with the above 'Uri' object.
| HttpWebRequest myHttpWebRequest= (HttpWebRequest)WebRequest.Create
| (myUri);
| // Create a new 'DateTime' object.
| DateTime today= DateTime.Now;
| if (DateTime.Compare(today,myHttpWebRequest.IfModifiedSince)==0)
| ....
|
| Does it mean, that WebRequest.Create method creates a connection to the
| server to get headers (they are checking the header before GetResponse)?
| I can not see anything like this in the docs for WebRequest.Create.
|
| So, what that example mean? How does all these classes work: WebClient,
| WebRequest, HttpWebRequest?
|
| I know that http connection has 2 passes: client connects to the server,
| server responds that is ready and after that client sends CONTINUE to
| get the page. So, is WebRequest.Create does that first pass only? I can
| not find any docs about it.
|
| That's what I need, to get response headers without retrieving the page.
| As I am not expert in http, I may be wrong, so please correct me.
|
| Thanks
| Sunny
|
 
Hi Sunny,

Sorry for letting you wait for so long.
The "IfModifiedSince" property is used to set the "If-Modified-Since"
header of the HTTP request. When the HTTP request with "If-Modified-Since"
header set is received by the HTTP server, and the requested file has not
been modified since the time specified in this field, the server will not
return the file. Instead, a HTTP 304 code will be returned. If the file has
been modified since the time specified, the file will be returned.

So you can set the IfModifiedSince property to your client file's last get
date, then it will hehave what you want.

Also, I think you can specify the HttpWebRequest.Method property to "HEAD",
then it will not send the body of the document.
If you need to see what is really going over the wire, you can use NetMon
to monitor this.

Hope this helps,

Best regards,
Jeffrey Tan
Microsoft Online Partner Support
Get Secure! - www.microsoft.com/security
This posting is provided "as is" with no warranties and confers no rights.

--------------------
| From: Sunny <[email protected]>
| Subject: RE: web pages cacheing
| Date: Wed, 12 Nov 2003 12:20:09 -0600
| References: <[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
<TMY9#[email protected]>
| Organization: Iceberg Wireless LLC
| MIME-Version: 1.0
| Content-Type: text/plain; charset="iso-8859-15"
| Content-Transfer-Encoding: 7bit
| X-Newsreader: MicroPlanet Gravity v2.60
| Message-ID: <#[email protected]>
| Newsgroups: microsoft.public.dotnet.languages.csharp
| NNTP-Posting-Host: 216.17.90.91
| Lines: 1
| Path:
cpmsftngxa06.phx.gbl!cpmsftngxa09.phx.gbl!TK2MSFTNGP08.phx.gbl!TK2MSFTNGP10.
phx.gbl
| Xref: cpmsftngxa06.phx.gbl microsoft.public.dotnet.languages.csharp:198755
| X-Tomcat-NG: microsoft.public.dotnet.languages.csharp
|
| Thanks Jeffrey,
| I'll wait.
|
| Sunny
|
| In article <TMY9#[email protected]>, v-
| (e-mail address removed) says...
| >
| > Hi Sunny,
| >
| > I have done some research for you. I have tested that sample.
| > The IfModifiedSince property can actually get the last modified time.
| > But I still can not make sure if this will actually get the content of
the
| > page.
| > I will do some research for you. I will reply to you ASAP.
| > Thanks for your understanding
| >
| > Best regards,
| > Jeffrey Tan
| > Microsoft Online Partner Support
| > Get Secure! - www.microsoft.com/security
| > This posting is provided "as is" with no warranties and confers no
rights.
|
 
Hi Jeffrey,
thanks for clarifying this. I'll try HEAD and NetMon to see.

So, following your explanation, it seems that the example for If-
Modified-Since in MSDN is wrong then. I.e., what they are checking in
the "if" statement if the connection is not established and they haven't
set the property yet?

Thanks
Sunny
 
Hi Sunny,

If you did not set the HttpWebRequest.IfModifiedSince property, .Net will
default set the HttpWebRequest.IfModifiedSince as the current
time.(Actully, it just return the current time). And .Net will not send
this value to the server side. So default, the page content will be
returned.

So if you set a time between now and the page's last modified time to
HttpWebRequest.IfModifiedSince, your application will generate 403
exception which denotes the page file did not modified.

Btw: You should not specify a future time to this property, or it will
return the content of the page.(You can confirm this by specify 2004,1,1 to
HttpWebRequest.IfModifiedSince).

Hope this helps,
Best regards,
Jeffrey Tan
Microsoft Online Partner Support
Get Secure! - www.microsoft.com/security
This posting is provided "as is" with no warranties and confers no rights.

--------------------
| From: Sunny <[email protected]>
| Subject: RE: web pages cacheing
| Date: Thu, 13 Nov 2003 12:01:05 -0600
| References: <[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
<TMY9#[email protected]>
<#[email protected]>
<[email protected]>
| Organization: Iceberg Wireless LLC
| MIME-Version: 1.0
| Content-Type: text/plain; charset="iso-8859-15"
| Content-Transfer-Encoding: 7bit
| X-Newsreader: MicroPlanet Gravity v2.60
| Message-ID: <#[email protected]>
| Newsgroups: microsoft.public.dotnet.languages.csharp
| NNTP-Posting-Host: 216.17.90.91
| Lines: 1
| Path: cpmsftngxa06.phx.gbl!TK2MSFTNGP08.phx.gbl!TK2MSFTNGP12.phx.gbl
| Xref: cpmsftngxa06.phx.gbl microsoft.public.dotnet.languages.csharp:199085
| X-Tomcat-NG: microsoft.public.dotnet.languages.csharp
|
| Hi Jeffrey,
| thanks for clarifying this. I'll try HEAD and NetMon to see.
|
| So, following your explanation, it seems that the example for If-
| Modified-Since in MSDN is wrong then. I.e., what they are checking in
| the "if" statement if the connection is not established and they haven't
| set the property yet?
|
| Thanks
| Sunny
|
| In article <[email protected]>, v-
| (e-mail address removed) says...
| >
| > Hi Sunny,
| >
| > Sorry for letting you wait for so long.
| > The "IfModifiedSince" property is used to set the "If-Modified-Since"
| > header of the HTTP request. When the HTTP request with
"If-Modified-Since"
| > header set is received by the HTTP server, and the requested file has
not
| > been modified since the time specified in this field, the server will
not
| > return the file. Instead, a HTTP 304 code will be returned. If the file
has
| > been modified since the time specified, the file will be returned.
| >
| > So you can set the IfModifiedSince property to your client file's last
get
| > date, then it will hehave what you want.
| >
| > Also, I think you can specify the HttpWebRequest.Method property to
"HEAD",
| > then it will not send the body of the document.
| > If you need to see what is really going over the wire, you can use
NetMon
| > to monitor this.
| >
| > Hope this helps,
| >
| > Best regards,
| > Jeffrey Tan
| > Microsoft Online Partner Support
| > Get Secure! - www.microsoft.com/security
| > This posting is provided "as is" with no warranties and confers no
rights.
| >
|
 
Hi Jeffrey,

v- said:
Hi Sunny,

If you did not set the HttpWebRequest.IfModifiedSince property, .Net will
default set the HttpWebRequest.IfModifiedSince as the current
time.(Actully, it just return the current time). And .Net will not send
this value to the server side. So default, the page content will be
returned.

So if you set a time between now and the page's last modified time to
HttpWebRequest.IfModifiedSince, your application will generate 403
exception which denotes the page file did not modified.

Btw: You should not specify a future time to this property, or it will
return the content of the page.(You can confirm this by specify 2004,1,1 to
HttpWebRequest.IfModifiedSince).

Hope this helps,
Best regards,
Jeffrey Tan
Microsoft Online Partner Support
Get Secure! - www.microsoft.com/security
This posting is provided "as is" with no warranties and confers no rights.


Thanks for the post. I clearly understand that. My last question was
about the example. What's the purpose of the "if" statement there? What
are they checking? It doesn't sound right to me.
Actually this is not important, I just was curious. And if is any
mistake, It'll be good to be corrected :)

Thanks
Sunny
 
Hi Sunny,

Thanks for your feedback.
I am glad my reply helps you.
I think in the sample, it does not set the IfModifiedSince property, so the
IfModifiedSince will always return DateTime.Now.
So the if statement will almost always return true, unless you trace the
sample step by step(not run at once), then after "DateTime today=
DateTime.Now" statement, the time has passed some when invoke
myHttpWebRequest.IfModifiedSince, while "today" variable will not change.
So the if statement will become false.

I think the if statement is just show you a good habit of checking this
property. It actually has no much meanning here.

Hope this helps,

Best regards,
Jeffrey Tan
Microsoft Online Partner Support
Get Secure! - www.microsoft.com/security
This posting is provided "as is" with no warranties and confers no rights.

--------------------
| From: Sunny <[email protected]>
| Subject: RE: web pages cacheing
| Date: Fri, 14 Nov 2003 08:37:20 -0600
| References: <[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
<TMY9#[email protected]>
<#[email protected]>
<[email protected]>
<#[email protected]>
<[email protected]>
| Organization: Iceberg Wireless LLC
| MIME-Version: 1.0
| Content-Type: text/plain; charset="iso-8859-15"
| Content-Transfer-Encoding: 7bit
| X-Newsreader: MicroPlanet Gravity v2.60
| Message-ID: <#[email protected]>
| Newsgroups: microsoft.public.dotnet.languages.csharp
| NNTP-Posting-Host: 216.17.90.91
| Lines: 1
| Path: cpmsftngxa06.phx.gbl!TK2MSFTNGP08.phx.gbl!TK2MSFTNGP12.phx.gbl
| Xref: cpmsftngxa06.phx.gbl microsoft.public.dotnet.languages.csharp:199333
| X-Tomcat-NG: microsoft.public.dotnet.languages.csharp
|
| Hi Jeffrey,
|
| In article <[email protected]>, v-
| (e-mail address removed) says...
| >
| > Hi Sunny,
| >
| > If you did not set the HttpWebRequest.IfModifiedSince property, .Net
will
| > default set the HttpWebRequest.IfModifiedSince as the current
| > time.(Actully, it just return the current time). And .Net will not send
| > this value to the server side. So default, the page content will be
| > returned.
| >
| > So if you set a time between now and the page's last modified time to
| > HttpWebRequest.IfModifiedSince, your application will generate 403
| > exception which denotes the page file did not modified.
| >
| > Btw: You should not specify a future time to this property, or it will
| > return the content of the page.(You can confirm this by specify
2004,1,1 to
| > HttpWebRequest.IfModifiedSince).
| >
| > Hope this helps,
| > Best regards,
| > Jeffrey Tan
| > Microsoft Online Partner Support
| > Get Secure! - www.microsoft.com/security
| > This posting is provided "as is" with no warranties and confers no
rights.
| >
|
|
| Thanks for the post. I clearly understand that. My last question was
| about the example. What's the purpose of the "if" statement there? What
| are they checking? It doesn't sound right to me.
| Actually this is not important, I just was curious. And if is any
| mistake, It'll be good to be corrected :)
|
| Thanks
| Sunny
|
 
Back
Top