Kaspersky wins yet another comparison

  • Thread starter Thread starter Jari Lehtonen
  • Start date Start date
??? I did.

excuse me, apparently my rhetorical question was not clear... i was
explaining to FRT how using scanners to decide what goes in a test will
leave out samples that should otherwise be in a test and how that can
corrupt the results... *you* then brought up categories, you're correct
about that, but it was a red-herring... the samples that the
scanner-filter method leaves out aren't going to magically all belong
to some uninteresting category...
I was talking about looking at categories of malware that
several good scanners test well in (according to quality tests) that
products to be tested by my method do not do well in. Or, conversely,
I also included that you could also see when a vendor suddenly decided
to drop detection in one of those same categories. It would be obvious
using my method over time when a vendor dropped detection of old DOS
viruses, for example.

that is the ideal scenario, however you cannot blindly hope that
reality will turn out ideally... you have to enumerate the ways in
which things can go wrong - something i tend to be good at...
I don't understand that sentence.

ok, i'll try again - why can't i be talking about samples that are from
all categories...
But in order for me to defend my
method which you attacked as being worthless, I would hope that you
would stick to that topic and not wander off onto something else.

i am still talking about your methodology, don't worry... i'm just
talking about one of the problems it has...
In the case of checking on a scanner with weak Trojan detection, for
example, that scanner is not used in building up the test bed.

yes, i would assume you don't actually require agreement between all
the scanners - that's why i said "imagine"...
I see
no problem. And a scanner used in building up the bed of old DOS
viruses can be tested later for a significant drop in detection in
this category.

i would steer clear of testing for such drops... significant reductions
*could* be a drop in detection of real viruses, or it could be a drop
in detection of crud... without a better means of determining viability
of samples it's impossible to be sure...
It would be better if you requested clarification before you rejected
my idea outright.

i didn't say you were unclear, you were quite clear... there's a
difference between being unclear and being over general... had you been
unclear then i would have been confused and i would have said to myself
'i think there's something wrong here'... instead i found you making
what i thought was a far reaching general statement and since i can't
read your mind i have no way to know when you intend to make a general
statement and when you don't...
You turned off any interest I had in further
discussion or clarification by pontificating and "instriucting" me and
insulting me by referring me to a treatise on logic. That pissed me
off.

i'm sorry you feel that way... personally i find that reference (and a
similar one i also have bookmarked) to be quite helpful in getting a
deeper understanding of what can go wrong in a logical argument (both
my own and other people's)...
Omitted population segments??

segments of the population of malware... your methodology will omit a
bunch of viruses, a bunch of worms, a bunch of trojans, etc. from the
final testbed... i'm sorry if statistical jargon terms like
'population' caught you off guard...
Improvement trends get missed?

your stated position is that you can use 'unscientific' tests to
discover trends - trends that presumably indicate the improvement or
deprecation of a scanner over time... trends that are less likely to
reveal themselves when you use scanners to select the samples that you
later test scanners on...
What in
the hell are you talking about?

things that can go wrong with what i currently understand of your
hypothetical unscientific test methodology...
 
excuse me, apparently my rhetorical question was not clear... i was
explaining to FRT how using scanners to decide what goes in a test will
leave out samples that should otherwise be in a test and how that can
corrupt the results... *you* then brought up categories, you're correct
about that, but it was a red-herring... the samples that the
scanner-filter method leaves out aren't going to magically all belong
to some uninteresting category...
Ok.


that is the ideal scenario, however you cannot blindly hope that
reality will turn out ideally... you have to enumerate the ways in
which things can go wrong - something i tend to be good at...

Me too. As an engineer, looking at worst case scenarios occupied a
good deal of my time over many decades.
ok, i'll try again - why can't i be talking about samples that are from
all categories...

Because I'm talking about samples from specific categories. I've only
mentioned two very broad ones that I've chosen. I'm not talking about
all categories of malware. I haven't considered others and I'm not
interested in others right now.
i am still talking about your methodology, don't worry... i'm just
talking about one of the problems it has...

yes, i would assume you don't actually require agreement between all
the scanners - that's why i said "imagine"...


i would steer clear of testing for such drops... significant reductions
*could* be a drop in detection of real viruses, or it could be a drop
in detection of crud... without a better means of determining viability
of samples it's impossible to be sure...

I agree that strictly speaking, all you could say is that one or the
other has occured. I disagree with "staying clear" for a couple of
reasons. First, it's not my purpose in this to draw peer review
quality conclusions. My purpose is to use the far more easily formed
tests beds to look for major trends or shifts in emphasis. It's
informal. It's a screening test. The idea is to be alerted by
relatively large changes. Second, it seems far more likely to me that
some vendor _might_ in the near future drop detection of old DOS
viruses than it is that they would suddenly fix their engines so as to
not detect crud :) I don't believe that crud detection is entirely on
purpose for the sake of playing the testing game. At the current
state of the art, detection of crud is unavoidable to some extent. If
it was avoidable, we could use scanners to tell us that a sample is
viable :) So I think a reasonably good conclusion would be that the
vendor has dropped detection of old DOS viruses and not crud. Good
enough to openly pursue the question with the vendor and raise
questions on the virus newgroups.
i'm sorry you feel that way... personally i find that reference (and a
similar one i also have bookmarked) to be quite helpful in getting a
deeper understanding of what can go wrong in a logical argument (both
my own and other people's)...

No harm done. I got over it :) Such is life in newsgroups.


Art
http://www.epix.net/~artnpeg
 
On Mon, 01 Mar 2004 18:28:18 -0500, kurt wismer <[email protected]> [snip]
why can't i be talking about
specimens that belong in categories that several good scanners *do*
handle but for whatever reason are not themselves handled yet?

I don't understand that sentence.

ok, i'll try again - why can't i be talking about samples that are from
all categories...


Because I'm talking about samples from specific categories. I've only
mentioned two very broad ones that I've chosen. I'm not talking about
all categories of malware. I haven't considered others and I'm not
interested in others right now.[/QUOTE]

and i'm talking about the samples that will get excluded from the test
because of the method of sample selection... some will belong in
categories you're not interested in, but not all... and since they
won't be included, any improvement or problem detecting those
particular samples will go unnoticed...

[snip]
I agree that strictly speaking, all you could say is that one or the
other has occured. I disagree with "staying clear" for a couple of
reasons.

well, it's just a statement of what i would do... the concern being
drawing conclusions that don't follow from the premises...
First, it's not my purpose in this to draw peer review
quality conclusions. My purpose is to use the far more easily formed
tests beds to look for major trends or shifts in emphasis. It's
informal. It's a screening test. The idea is to be alerted by
relatively large changes.

ok, and you can do that, but you can't necessarily conclude what kinds
of changes those are... if a vendor rewrites their scanning engine with
the express purpose of performing more exact identification and thereby
cutting down on false alarms i would expect their crud detection to
change significantly...
Second, it seems far more likely to me that
some vendor _might_ in the near future drop detection of old DOS
viruses than it is that they would suddenly fix their engines so as to
not detect crud :) I don't believe that crud detection is entirely on
purpose for the sake of playing the testing game. At the current
state of the art, detection of crud is unavoidable to some extent. If
it was avoidable, we could use scanners to tell us that a sample is
viable :) So I think a reasonably good conclusion would be that the
vendor has dropped detection of old DOS viruses and not crud. Good
enough to openly pursue the question with the vendor and raise
questions on the virus newgroups.

this is exactly why i would have steered clear of testing for drops in
detection rates - it's too easy to jump to conclusions about what kind
of changes are actually going on... technically all the test would
really tell us is that the detection of *something* changed
significantly, be that something that was supposed to be detected or
something that wasn't...
 
(e-mail address removed) wrote:
On Mon, 01 Mar 2004 18:28:18 -0500, kurt wismer <[email protected]> [snip]
why can't i be talking about
specimens that belong in categories that several good scanners *do*
handle but for whatever reason are not themselves handled yet?

I don't understand that sentence.

ok, i'll try again - why can't i be talking about samples that are from
all categories...


Because I'm talking about samples from specific categories. I've only
mentioned two very broad ones that I've chosen. I'm not talking about
all categories of malware. I haven't considered others and I'm not
interested in others right now.

and i'm talking about the samples that will get excluded from the test
because of the method of sample selection... some will belong in
categories you're not interested in, but not all... and since they
won't be included, any improvement or problem detecting those
particular samples will go unnoticed...

True but hardly significant for my purposes.
[snip]
I agree that strictly speaking, all you could say is that one or the
other has occured. I disagree with "staying clear" for a couple of
reasons.

well, it's just a statement of what i would do... the concern being
drawing conclusions that don't follow from the premises...
First, it's not my purpose in this to draw peer review
quality conclusions. My purpose is to use the far more easily formed
tests beds to look for major trends or shifts in emphasis. It's
informal. It's a screening test. The idea is to be alerted by
relatively large changes.

ok, and you can do that, but you can't necessarily conclude what kinds
of changes those are... if a vendor rewrites their scanning engine with
the express purpose of performing more exact identification and thereby
cutting down on false alarms i would expect their crud detection to
change significantly...

Another factor that I didn't mention is an assumption that crud only
accounts for maybe 10% (to use Nick's number) of the samples. I dunno
if he meant raw from vxer sites without any culling at all or not, but
the implication in the context was that he meant raw. It's fairly easy
to cut that raw percentage drastically using F-Prot /collect and TBAV
(for old DOS viruses) as I've actually done and as someone at Virus
Bulletin wrote a paper on that I saw not long ago. Right off the bat,
a significant pile of crud files can be elminated with ease. Now, I
have no measure, of course, of the percentage of crud I wound up with
but there is a "hidden assumption" in my mind that it's fairly small
.... on the order of maybe just 1% to 3 %. Anyway, this is another
reason why I don't believe a sudden drop in crud detection would
affect my conclusions.
this is exactly why i would have steered clear of testing for drops in
detection rates - it's too easy to jump to conclusions about what kind
of changes are actually going on... technically all the test would
really tell us is that the detection of *something* changed
significantly, be that something that was supposed to be detected or
something that wasn't...

You're just too damn much into the picky picky to see the forest for
the trees and the significances for the insignificances :)


Art
http://www.epix.net/~artnpeg
 
(e-mail address removed) wrote: [snip]
Because I'm talking about samples from specific categories. I've only
mentioned two very broad ones that I've chosen. I'm not talking about
all categories of malware. I haven't considered others and I'm not
interested in others right now.

and i'm talking about the samples that will get excluded from the test
because of the method of sample selection... some will belong in
categories you're not interested in, but not all... and since they
won't be included, any improvement or problem detecting those
particular samples will go unnoticed...

True but hardly significant for my purposes.

unfortunately we really only have your say so, your intuition that
they'd be insignificant..

[snip]
Another factor that I didn't mention is an assumption that crud only
accounts for maybe 10% (to use Nick's number) of the samples. I dunno
if he meant raw from vxer sites without any culling at all or not, but
the implication in the context was that he meant raw. It's fairly easy
to cut that raw percentage drastically using F-Prot /collect and TBAV
(for old DOS viruses) as I've actually done and as someone at Virus
Bulletin wrote a paper on that I saw not long ago. Right off the bat,
a significant pile of crud files can be elminated with ease. Now, I
have no measure, of course, of the percentage of crud I wound up with
but there is a "hidden assumption" in my mind that it's fairly small
.... on the order of maybe just 1% to 3 %. Anyway, this is another
reason why I don't believe a sudden drop in crud detection would
affect my conclusions.

i don't know that the context in which that 10% figure applies would
necessarily be generalizable to your scenario... it seems as though
this system is a breeding ground for uncertainty... i would only feel
comfortable with the significant change conclusion if it was *very*
significant...

[snip]
You're just too damn much into the picky picky to see the forest for
the trees and the significances for the insignificances :)

when it comes to statistical exercises (which is essentially what
detection tests are) i tend to think that's a good thing...
 
when it comes to statistical exercises (which is essentially what
detection tests are) i tend to think that's a good thing...

Just to further annoy you (not really ;)) I happened to think of
another good use I actually found for my "unscientific" test bed. This
goes back to the days of the 16 bit AVPLITE for DOS. KAV introduced
AVPDOS32, and I noticed that it was alerting on some script viruses (I
think it was) that AVPLITE was not. It was peculiar since AVPLITE did
alert on some but failed to alert on a majority in the category that
the new AVPDOS32 did alert on.

When I queried KAV, I did get a brief response to the effect that I
should be using the new 32 bit version (which I already knew) since
there would be this sort of problem with the 16 bit version. They
never did announce publicly that the old 16 bit version was
discontinued or was no longer as fully effective as the new 32 bit
version. I think they continued to have it available for download from
the Russian site for quite a long time after this point in time.
Meanwhile, people were still actively using it, judging by many posts
on acv. I did mention what I had seen and heard but I doubt that had
much effect on the AVPLITE for DOS enthusiasts. After all, it was free
and AVPDOS32 was not.

BTW, AVPDOS32 is still availble from the Swiss site, and I had noticed
a detection flaw with it some time ago, compared to a later build
called KAVDOS32 build 135. I don't recall off hand what it is, nor
have I mentioned it anywhere until now.

Sometimes you're on your own, as it were. If it hadn't been for my
informal test bed, I would't have had a clue. Certainly, I rely
primarily on quality independent tests. But I wouldn't do without my
useful informal collection either.


Art
http://www.epix.net/~artnpeg
 
BTW, AVPDOS32 is still availble from the Swiss site, and I had noticed
a detection flaw with it some time ago, compared to a later build
called KAVDOS32 build 135. I don't recall off hand what it is, nor
have I mentioned it anywhere until now.

There was a Sobig sample that wasn't detected by build 133. I think
build 134 was able to detect it.
 
kurt wismer said:
FromTheRafters wrote:
[snip]
they wish to use. When a popular test organization has the
AVs jumping through hoops that have less than real world
significance, it causes the AVs to change their program so
that they can look better in the comparison tests.

and which hoops would those be, precisely? as far as i know the only
constraint placed on the scanners is that they detect what they're
supposed to detect and that they are able to save their output to a log
file...

That seems entirely reasonable, as long as what they're *supposed*
to detect isn't malware nested six layers deep in archives or within
container files that are several steps away from becoming a threat.
Users should be capable of getting the malware up to the point of it
becoming a threat and scanning it then.
 
kurt wismer said:
FromTheRafters wrote:

i'm saying that in the absence of any controls on the testbed's
integrity, multiple instances of the same piece of malware will a) be
present, and b) be counted as separate things... you cannot avoid
counting them as separate things if you don't know they are duplicates
and if you did know they were duplicates you wouldn't allow them to be
there in the first place...

I had the impression that the normal method was to take the original
sample from the 'collection', infect various files with it, and cull out
from this population the ones that fail to become grandparents, and
use the remaining (some number) of samples in the testbed. Children,
parents, and great-grandparents are not used - but the population of
grandparents (or a subset thereof) is used and is likely to contain a
'duplicate' virus (although probably not an exact copy) within it.

[snip]
 
FromTheRafters said:
kurt wismer said:
FromTheRafters wrote:
[snip]
they wish to use. When a popular test organization has the
AVs jumping through hoops that have less than real world
significance, it causes the AVs to change their program so
that they can look better in the comparison tests.

and which hoops would those be, precisely? as far as i know the only
constraint placed on the scanners is that they detect what they're
supposed to detect and that they are able to save their output to a log
file...

That seems entirely reasonable, as long as what they're *supposed*
to detect isn't malware nested six layers deep in archives or within
container files that are several steps away from becoming a threat.
Users should be capable of getting the malware up to the point of it
becoming a threat and scanning it then.

so then my question would be what "popular test organization" makes
this a requirement in their core testing? i could see testing that if
one was evaluating value added features but it's not a part of virus
detection per se...
 
Just to further annoy you (not really ;)) I happened to think of
another good use I actually found for my "unscientific" test bed.

there are all sorts of uses one can come up with if one can think
outside the box...

[snip]
Sometimes you're on your own, as it were. If it hadn't been for my
informal test bed, I would't have had a clue.

so it played the role of the canary in the coal mine, more or less...
 
FromTheRafters said:
I had the impression that the normal method was to take the original
sample from the 'collection', infect various files with it, and cull out
from this population the ones that fail to become grandparents, and
use the remaining (some number) of samples in the testbed. Children,
parents, and great-grandparents are not used - but the population of
grandparents (or a subset thereof) is used and is likely to contain a
'duplicate' virus (although probably not an exact copy) within it.

first of all, that process represents a type of control (specifically
it's a generic viability control) on the testbed integrity...

second, if the virus was the type that was in any way polymorphic you
would probably want multiple copies to ensure complete detection of
that variant, but that's an exception to the rule... generally
duplicates make calculating the results more difficult because you have
to make sure you don't count them as separate distinct viruses...
 
Back
Top