H
Hector Santos
We have a high end client/server RPC systems and on rare occasions, we get
customer reports of RPC errors.
It is been our support history that when customers report RPC related
errors, it usually points to some either hardware and/or networking issue.
Most RPC experts I've talk to directly agree with this.
Although it a rare report, when it does come in, it is always a frustrating
report because it is not something we are able to solve right away and it
because difficult to convince the customer the application reporting RPC
errors is basically revealing much deeper issues going on the machine, NIC,
hardwire or network connection.
As you can imagine, most people, including myself do like to hear they need
to fix or look at hardware to solve a "software issue."
Nonetheless, in the end, nearly 100% of the time, after the customer
finally has no choice but take our advice to investigate the system, the
problem is found with one of the following findings:
- Windows needing service packs,
- Software based system performance issues caused by heavy I/O,
- a bad NIC,
- a bad cluster on the hard drive,
- network interruptions, router
- LAN topology, i.e, daisy vs. hub
- bad motherboards,
- or a revamp of the hardware/machine, etc.
The above is pretty much the recommended order we provide to customers to
analyze RPC related errors before any consideration of a system
upgrade/revamp especially for those who do have older setups.
In any case, its happening again with a very large customer of ours who
previously experienced the rare RPC 1722/1726 errors, a few times a month
or so and after giving them the above list, they finally opted to upgrade
the system from NT to an high end DELL server running Windows 2000/Advanced
Server.
Now, it is happening 3-5 times a day and it has now become mission critical
to get this resolved ASAP.
A MSDN KB lookup for "RPC 1722 1726 Windows 2000" resulted in this KB:
"The Cluster Service Detects RPC Errors 1726 and 1722"
http://support.microsoft.com/default.aspx?kbid=326330
Unless I am reading the KB wrong, it seems to indicate the problem is fixed
in Window 2000 Service Pack 4.
The customer is already using SP4.
However, it also talks about an available Hotfix by contacting Microsoft
How do you read the above KB?
What else can you think of can help cause the 1722 and 1726 RPC errors?
Anyway, I have exhausted all possible reasons so I am trying to see what can
I do within our client/server software to help address the issue once and
for all, like possibly use (or write) an independent RPC Testing Tool. Is
there such a thing available now?
In addition, is there any RPC technique that I may use to maybe work around
this?
For example:
Our RPC client (a modem/internet hosting server) issues a 2 second heartbeat
function call to the RPC server. It obtains statistical information with
this server RPC call to display on the host monitor window.
It is during this heartbeat where the function call returns an RPC error and
the host displays the RPC error saying "Critical RPC Communication:
Continue or Stop?"
A major reduction of customer RPC error support calls was achieved by
changing the heartbeat to count three (3) consecutive RPC errors before
issuing the popup message.
We did this, because early on, we discover the majority of the RPC errors
was a result of some network interruption, a "blip" in the network
communications.
For example, for customers who had legacy DAISY cabled LAN, if someone
temporarily pulled the cabled off one PC machine on the LAN, our HOST
running anywhere on the LAN would instantly detect an RPC error.
Of course, for these customers, we recommended upgrading the LAN by using a
better network (a HUB/router) to solve this problem.
But we also saw in some other instances where a "network blip" can occur for
other reasons; such as heavy disk I/O, network congestion, etc.
In short, if the "communications" was slowed down, the RPC error was
detected.
So by adding the error count, it drastically helped reduce the support calls
related to RPC errors. When we do get the calls, it is normally something
related to what I described above.
Anyway, what other available RPC (or NON-RPC) technique I can add to help
see if there is a "real" RPC issue or get around this error? Maybe see
re-bind works?
I appreciate any input on this matter, including if you just agree that in
your experience these RPC errors are normally based on some
hardware/performance/network communications issue with the customer setup.
Thanks
customer reports of RPC errors.
It is been our support history that when customers report RPC related
errors, it usually points to some either hardware and/or networking issue.
Most RPC experts I've talk to directly agree with this.
Although it a rare report, when it does come in, it is always a frustrating
report because it is not something we are able to solve right away and it
because difficult to convince the customer the application reporting RPC
errors is basically revealing much deeper issues going on the machine, NIC,
hardwire or network connection.
As you can imagine, most people, including myself do like to hear they need
to fix or look at hardware to solve a "software issue."
Nonetheless, in the end, nearly 100% of the time, after the customer
finally has no choice but take our advice to investigate the system, the
problem is found with one of the following findings:
- Windows needing service packs,
- Software based system performance issues caused by heavy I/O,
- a bad NIC,
- a bad cluster on the hard drive,
- network interruptions, router
- LAN topology, i.e, daisy vs. hub
- bad motherboards,
- or a revamp of the hardware/machine, etc.
The above is pretty much the recommended order we provide to customers to
analyze RPC related errors before any consideration of a system
upgrade/revamp especially for those who do have older setups.
In any case, its happening again with a very large customer of ours who
previously experienced the rare RPC 1722/1726 errors, a few times a month
or so and after giving them the above list, they finally opted to upgrade
the system from NT to an high end DELL server running Windows 2000/Advanced
Server.
Now, it is happening 3-5 times a day and it has now become mission critical
to get this resolved ASAP.
A MSDN KB lookup for "RPC 1722 1726 Windows 2000" resulted in this KB:
"The Cluster Service Detects RPC Errors 1726 and 1722"
http://support.microsoft.com/default.aspx?kbid=326330
Unless I am reading the KB wrong, it seems to indicate the problem is fixed
in Window 2000 Service Pack 4.
The customer is already using SP4.
However, it also talks about an available Hotfix by contacting Microsoft
How do you read the above KB?
What else can you think of can help cause the 1722 and 1726 RPC errors?
Anyway, I have exhausted all possible reasons so I am trying to see what can
I do within our client/server software to help address the issue once and
for all, like possibly use (or write) an independent RPC Testing Tool. Is
there such a thing available now?
In addition, is there any RPC technique that I may use to maybe work around
this?
For example:
Our RPC client (a modem/internet hosting server) issues a 2 second heartbeat
function call to the RPC server. It obtains statistical information with
this server RPC call to display on the host monitor window.
It is during this heartbeat where the function call returns an RPC error and
the host displays the RPC error saying "Critical RPC Communication:
Continue or Stop?"
A major reduction of customer RPC error support calls was achieved by
changing the heartbeat to count three (3) consecutive RPC errors before
issuing the popup message.
We did this, because early on, we discover the majority of the RPC errors
was a result of some network interruption, a "blip" in the network
communications.
For example, for customers who had legacy DAISY cabled LAN, if someone
temporarily pulled the cabled off one PC machine on the LAN, our HOST
running anywhere on the LAN would instantly detect an RPC error.
Of course, for these customers, we recommended upgrading the LAN by using a
better network (a HUB/router) to solve this problem.
But we also saw in some other instances where a "network blip" can occur for
other reasons; such as heavy disk I/O, network congestion, etc.
In short, if the "communications" was slowed down, the RPC error was
detected.
So by adding the error count, it drastically helped reduce the support calls
related to RPC errors. When we do get the calls, it is normally something
related to what I described above.
Anyway, what other available RPC (or NON-RPC) technique I can add to help
see if there is a "real" RPC issue or get around this error? Maybe see
re-bind works?
I appreciate any input on this matter, including if you just agree that in
your experience these RPC errors are normally based on some
hardware/performance/network communications issue with the customer setup.
Thanks