Kaspersky wins yet another comparison

  • Thread starter Thread starter Jari Lehtonen
  • Start date Start date
(e-mail address removed) wrote: [snip]
Lacking quality and recent broad based tests, how can we determine the
current trends?

wait until another good quality test is performed...

That's your opinion. It's not mine.

it's not an opinion, it's an instruction... you asked how, i told you
how...
Mistaken impression???


You're off on a tangent that I wasn't even talking about.

correction, it may not be what you meant, but it most definitely is
what you were talking about... you said only "unscientific tests", you
didn't specify what kind of unscientific test... clearly my example is
an unscientific test...
Bull. If I have a large test bed which includes many Trojans and
product X failed to alert on 95% of them and a year later product X
alerts on 70% of them I'm justified in drawing the conclusion that
product X has been addressing their lack of Trojan detection, no?

no... i've just presented an example which shows how extreme
differences in results can be caused by the detection or non-detection
of a single piece of malware that has multiple instances in the testbed...
It doesn't even matter for this purpose that 10% of my samples aren't
viable. I'm just looking for a major _change_ in the detection
characteristics of a product.

viability of the samples is not the only concern when it comes to
testbed integrity... uniqueness is also a concern, a big concern, and
it's even harder to address than viability...
Similarly, if product Y alerted on 95% of the old DOS viruses in my
collection last yeat and today it alerts on 10% of them, I'm justified
in concluding that a major change in direction has been made by the
producers of product Y.

if you think so then you are clearly operating under unstated
assumptions about the nature of that collection...

as such i would point you to
http://www.infidels.org/news/atheism/logic.html#alterapars

we can't get very far in a discussion if we don't start out on the same
page... this latest real life test has underlined for me the importance
of being aware of the assumptions we make about test beds so that we
can judge whether or not they're appropriate... uniqueness is something
that is easily overlooked... if one hasn't taken the time to weed out
the non-viable samples you can bet the duplicates are there too...
 
Robert de Brus wrote:

....
99%? So basically what this means is that one can still get infected?

Obviously it's rubbish!

Nobody & no thing is perfect.

Statistically, even an event with probability Zero _may_ happen, and
at once.

AV scanners are necessarily not up-to-the-very-hour. (Look at their
updating frequency.)

Don't forget Safe Hex.

So what is rubbish?

Roy
 
(e-mail address removed) wrote: [snip]
Lacking quality and recent broad based tests, how can we determine the
current trends?

wait until another good quality test is performed...

That's your opinion. It's not mine.

it's not an opinion, it's an instruction... you asked how, i told you
how...

Wrong. It's an opinion and a wrong one. There's no need to wait for
the next quality test to detect trends. And I've explained why. End of
subject.


Art
http://www.epix.net/~artnpeg
 
[snip]
lets perform a thought experiment, shall we? lets consider a test that
has 3 samples... the first sample is virus A and the second 2 samples
are virus B... the test, being unscientific, counts all 3 as separate
viruses.... scanner X misses virus A, scanner Y misses virus B - the
results *should* be 50% for both but because of the improper
methodology it turns out to be 66% for scanner X and 33% for scanner Y...

You're off on a tangent that I wasn't even talking about.

I, too, see apples and oranges here. Kurt is talking about the overall
"test" methodology I think, and not just the tester's data set maintenance
methodology.

I think that you are right in assuming that a flawed data set can
still be good for a time-difference comparison between each
vendor's' versions. The *same* flawed data set and the *same*
flawed test methodology can indeed be called a "comparison"
test between versions.

A nearly ideal test would have a data set more representative of what
is likely to be seen in the field. Once you force restrictions on the test
set it starts to become skewed. It is not practical for a tester to have
a representative cross section of all existing programs so concessions
must be made. After putting all of the necessary restrictions on the data
set and the test method, you are making the AV strive to attain a less
than optimum goal. They will strive to be the "best at test" rather than
to be the best at their real world function.

The *good* tests are merely the least harmful ones.
 
Robert de Brus said:
X-No-Archive: Yes

In Jari Lehtonen <[email protected]> typed
|| Tested by AV-Comparatives organization., the Kaspersky Antivirus gets
|| the best on-demand results with 99.85% of malware detected, McAfee
|| seconds with 95.41%.

99%? So basically what this means is that one can still get infected?

Obviously it's rubbish!

Yeah, there ought to be a law against anything less than 100% effective
telling you that you are protected. ;o)
 
[snip]
lets perform a thought experiment, shall we? lets consider a test that
has 3 samples... the first sample is virus A and the second 2 samples
are virus B... the test, being unscientific, counts all 3 as separate
viruses.... scanner X misses virus A, scanner Y misses virus B - the
results *should* be 50% for both but because of the improper
methodology it turns out to be 66% for scanner X and 33% for scanner Y...

You're off on a tangent that I wasn't even talking about.

I, too, see apples and oranges here. Kurt is talking about the overall
"test" methodology I think, and not just the tester's data set maintenance
methodology.

I think that you are right in assuming that a flawed data set can
still be good for a time-difference comparison between each
vendor's' versions. The *same* flawed data set and the *same*
flawed test methodology can indeed be called a "comparison"
test between versions.

Yes, essentially that's the idea. Simply doing comparisons over time
of what scanner X reports on file Y involves no assumptions other than
file Y in my collection hasn't changed :) And also I see minimal
problems for this purpose in using what the scanners report to
categorize the files. If several good scanners identify file Z as the
POOP Trojan (or an alias name), then file Z goes into my Trojan test
bed ... providing there are no other files in that bed that the same
scanners identified as the POOP Trojan. You strive for zero duplicates
of course in order to have a real variety in each category of
interest.

Having created the categorized test beds, and having found that
scanner A only alerted on 10% of the Trojans in the past but it now
alerts on 70% of them, I see it as obvious that the scanner A vendor
has been doing some work in this area. And that's the only kind of
trend I'm talking about here ... not trends of which scanners score
the highest in various categories of detection.
A nearly ideal test would have a data set more representative of what
is likely to be seen in the field. Once you force restrictions on the test
set it starts to become skewed. It is not practical for a tester to have
a representative cross section of all existing programs so concessions
must be made. After putting all of the necessary restrictions on the data
set and the test method, you are making the AV strive to attain a less
than optimum goal. They will strive to be the "best at test" rather than
to be the best at their real world function.

The *good* tests are merely the least harmful ones.

Which specific problem(s) do you have in mind here?


Art
http://www.epix.net/~artnpeg
 
On Sat, 28 Feb 2004 15:31:21 -0500, kurt wismer <[email protected]>

(e-mail address removed) wrote:
[snip]

Lacking quality and recent broad based tests, how can we determine the
current trends?

wait until another good quality test is performed...

That's your opinion. It's not mine.

it's not an opinion, it's an instruction... you asked how, i told you
how...

Wrong. It's an opinion and a wrong one.[/QUOTE]

not in the dialect of english i happen to speak, it ain't... up where
i'm from it's an instruction, just like "hold your horses" or "go fly a
kite"... maybe where you're from it's an opinion, but frankly, that's
just weird...
There's no need to wait for
the next quality test to detect trends. And I've explained why. End of
subject.

what you've explained is your justification for accepting anecdotal
evidence... what i've tried to show you is that without controls on the
quality of the testbed, non-uniqueness of samples can invalidate any
conclusion you hope to draw from such a test...
 
FromTheRafters said:
I, too, see apples and oranges here. Kurt is talking about the overall
"test" methodology I think, and not just the tester's data set maintenance
methodology.

in fact, i did not go in depth about test methodology at all... i only
dealt with the issue of testbed integrity...
I think that you are right in assuming that a flawed data set can
still be good for a time-difference comparison between each
vendor's' versions. The *same* flawed data set and the *same*
flawed test methodology can indeed be called a "comparison"
test between versions.

except that the presence of duplicates magnifies the appearance of what
might otherwise be insignificant changes...
 
Yes, essentially that's the idea. Simply doing comparisons over time
of what scanner X reports on file Y involves no assumptions other than
file Y in my collection hasn't changed :) And also I see minimal
problems for this purpose in using what the scanners report to
categorize the files. If several good scanners identify file Z as the
POOP Trojan (or an alias name), then file Z goes into my Trojan test
bed ... providing there are no other files in that bed that the same
scanners identified as the POOP Trojan. You strive for zero duplicates
of course in order to have a real variety in each category of
interest.

and so it comes out, there *were* unstated assumptions about the nature
of the testbed used in your hypothetical 'unscientific tests'....
 
Roy said:
Statistically, even an event with probability Zero _may_ happen, and at
once.

ummm no it can't... if someone says event X has a zero probability and
event X happens, then that someone was wrong and the probability wasn't
actually zero...
 
On Sat, 28 Feb 2004 21:03:32 -0500, "FromTheRafters"


Which specific problem(s) do you have in mind here?

Nothing really specific Art, just that people want to have
comparison tests to reference when deciding on which AV
they wish to use. When a popular test organization has the
AVs jumping through hoops that have less than real world
significance, it causes the AVs to change their program so
that they can look better in the comparison tests.
 
FromTheRafters wrote:
[snip]
they wish to use. When a popular test organization has the
AVs jumping through hoops that have less than real world
significance, it causes the AVs to change their program so
that they can look better in the comparison tests.

and which hoops would those be, precisely? as far as i know the only
constraint placed on the scanners is that they detect what they're
supposed to detect and that they are able to save their output to a log
file...
 
kurt wismer said:
in fact, i did not go in depth about test methodology at all... i only
dealt with the issue of testbed integrity...

Your statement about the "test" counting two instances of
virus B as two viruses made me think that the test method
was in question. Are you saying that the count is done by
the dataset maintenance's method and not by the test? I
would think that a test would want to have many instances
of virus B (polymorphic?) and count misses as misses. That
is to say that the AV being tested missed a virus B not all
virus Bs.
except that the presence of duplicates magnifies the appearance of what
might otherwise be insignificant changes...

True, but I didn't assume that Art was talking about quantative
measurements - only trends. You are right though, a lot would
depend on how mucked up the testbed and test was to begin
with.
 
Nothing really specific Art, just that people want to have
comparison tests to reference when deciding on which AV
they wish to use. When a popular test organization has the
AVs jumping through hoops that have less than real world
significance, it causes the AVs to change their program so
that they can look better in the comparison tests.

Well, it seems to me that "real world significance" is like beauty.
It's in the eyes of the beholder :)


Art
http://www.epix.net/~artnpeg
 
FromTheRafters said:
kurt wismer said:
FromTheRafters said:
[snip]

lets perform a thought experiment, shall we? lets consider a test that
has 3 samples... the first sample is virus A and the second 2 samples
are virus B... the test, being unscientific, counts all 3 as separate
viruses.... scanner X misses virus A, scanner Y misses virus B - the
results *should* be 50% for both but because of the improper
methodology it turns out to be 66% for scanner X and 33% for scanner Y...

You're off on a tangent that I wasn't even talking about.

I, too, see apples and oranges here. Kurt is talking about the overall
"test" methodology I think, and not just the tester's data set maintenance
methodology.

in fact, i did not go in depth about test methodology at all... i only
dealt with the issue of testbed integrity...


Your statement about the "test" counting two instances of
virus B as two viruses made me think that the test method
was in question. Are you saying that the count is done by
the dataset maintenance's method and not by the test?

i'm saying that in the absence of any controls on the testbed's
integrity, multiple instances of the same piece of malware will a) be
present, and b) be counted as separate things... you cannot avoid
counting them as separate things if you don't know they are duplicates
and if you did know they were duplicates you wouldn't allow them to be
there in the first place...

also, while viability is something that can be tested for generically,
uniqueness is not... art's suggested method of letting the scanners do
the classification is a kludge and assumes that all the samples you're
using are detected and *identified* by at least one scanner (heuristic
detection may seem like a reasonable kludge for classification when
your concern is viability, but obviously does not help to establish
uniqueness)...

of course the sampling bias generated by this method means that if a
product significantly improves it's detection for samples that none of
the products could originally detect, you won't be able to see it...
that means such a test can only detect improvements and cannot detect
the lack of improvement... the conclusions one can draw from such a
test are quite limited...
I
would think that a test would want to have many instances
of virus B (polymorphic?) and count misses as misses. That
is to say that the AV being tested missed a virus B not all
virus Bs.

true, but detection in that case should be all or nothing... all
instances of polymorphic virus A should count as 1 and if you don't
detect them all you score a 0...
True, but I didn't assume that Art was talking about quantative
measurements - only trends.

even for trends you'd be looking for 'significant' improvements - but
without quality control on the testbed, such determinations of
significance are specious...
 
i'm saying that in the absence of any controls on the testbed's
integrity,

A test bed that isn't "scientific" isn't necessarily uncontrolled.
When I use the term "scientific" in this context I'm using it as I
think knowledgeable people here use it. As a bare minimum, all samples
in a scientific collection have been tested for viability. Not meeting
that bare minimum requirement makes a collection "unscientific" right
off the bat. There would be quite a number of other factors as well,
of course.
multiple instances of the same piece of malware will a) be
present,

Several good scanners identify a sample as the POOP Trojan and no
other samples are allowed in the Trojan category bed identified as
POOP or its alias names. What do we have here? There's the remote
possibility that several good scanners have all misidentified POOP.
But we're not interested in using just one sample. We're interested in
using at least several hundred ... say 1,000 all chosen in the same
way. Now, you have to assign some unknown but reasonable probability
figure that several scanners will all misidentify ... and then compute
from this unknown figure a probable number of duplicates. Further, you
would have to be concerned that that number is significant when using
the test bed to look for increases in detection of Trojans from 100 to
700 (10% to 70%). I say you're calculating "smoke" as we used to say
when some engineer was worried about some minute and insignificant
effect. And you're talking about "smoke".
and b) be counted as separate things... you cannot avoid
counting them as separate things if you don't know they are duplicates
and if you did know they were duplicates you wouldn't allow them to be
there in the first place...

also, while viability is something that can be tested for generically,
uniqueness is not... art's suggested method of letting the scanners do
the classification is a kludge and assumes that all the samples you're
using are detected and *identified* by at least one scanner

Wrong. Read what I wrote. I require that _several_ scanners all agree
before a sample is included.
(heuristic
detection may seem like a reasonable kludge for classification when
your concern is viability, but obviously does not help to establish
uniqueness)...

So turn off the scanner heuristics then. That's what I'd do.
of course the sampling bias generated by this method means that if a
product significantly improves it's detection for samples that none of
the products could originally detect, you won't be able to see it...

Not interested in categories that several good scanners aren't already
quite proficient in handling. In fact, it's pure nonsese to even bring
it up.
that means such a test can only detect improvements and cannot detect
the lack of improvement... the conclusions one can draw from such a
test are quite limited...

It means nothing at all. You're inventing straw agruments again.


Art
http://www.epix.net/~artnpeg
 
A test bed that isn't "scientific" isn't necessarily uncontrolled.

agreed... however you did not initially give any additional
specifications on what you meant beyond "unscientific test" and i can't
read your mind... my thought experiment used the uncontrolled type of
testbed...
When I use the term "scientific" in this context I'm using it as I
think knowledgeable people here use it. As a bare minimum, all samples
in a scientific collection have been tested for viability.

great, but you were talking about unscientific tests - constraining
what you mean by 'scientific test' still leaves 'unscientific test's
fairly wide open...

i now know that you are referring to a test that uses a testbed where
non-viable and duplicate samples are weeded out by a sort of 'majority
vote' by a set of scanners you trust... i only know this because
FromTheRafters managed to coax these details out of you, however...
Not meeting
that bare minimum requirement makes a collection "unscientific" right
off the bat. There would be quite a number of other factors as well,
of course.

of course...
Several good scanners identify a sample as the POOP Trojan and no
other samples are allowed in the Trojan category bed identified as
POOP or its alias names. What do we have here? There's the remote
possibility that several good scanners have all misidentified POOP.
But we're not interested in using just one sample. We're interested in
using at least several hundred ... say 1,000 all chosen in the same
way. Now, you have to assign some unknown but reasonable probability
figure that several scanners will all misidentify ... and then compute
from this unknown figure a probable number of duplicates. Further, you
would have to be concerned that that number is significant when using
the test bed to look for increases in detection of Trojans from 100 to
700 (10% to 70%). I say you're calculating "smoke" as we used to say
when some engineer was worried about some minute and insignificant
effect. And you're talking about "smoke".

not so... i was talking about a situation where there is no quality
control on the testbed (since you originally made no specifications on
what, if any, kinds of controls would be present)... that's very
different from the situation where the quality control fails...
Wrong. Read what I wrote. I require that _several_ scanners all agree
before a sample is included.

art, "several scanners" happens to satisfy the "at least one scanner"
constraint...

on rereading the quote i think i may have misspoke, in the previous
article... 'implies' rather than 'assumes'... it implies that all the
samples you're using in the test are detected and identified...
So turn off the scanner heuristics then. That's what I'd do.

?? perhaps you want to re-read that section - i don't need to turn off
the heuristics, i just can't use those particular types of results...
it's not a problem, it's just the reason why scanner based
classification requires identification rather than just detection...
Not interested in categories that several good scanners aren't already
quite proficient in handling. In fact, it's pure nonsese to even bring
it up.

who said anything about categories? why can't i be talking about
specimens that belong in categories that several good scanners *do*
handle but for whatever reason are not themselves handled yet?

and since you require agreement between several good scanners for
inclusion in your hypothetical unscientific test you're actually
increasing the potential size of the set of malware where improvements
will go unnoticed... imagine if you required agreement between all
scanners, then there'd be no room for improvement...
It means nothing at all. You're inventing straw agruments again.

you mean a 'straw man'... perhaps i am, but really, it would be much
easier to avoid misrepresenting your position if you'd fully specify
your position in the first place, or further specify it when it becomes
clear that you've been too general...

so now i know we're talking about a testbed thats been classified by
several scanners in order to weed out duplicates and probable
non-viable samples... so we've hopefully eliminated the possibility of
unpredictable 'improvement' scaling factors but we've introduced the
problem of omitted population segments discussed previously... the
improvement trends you hope to discover may get missed due to the
self-selected sample bias...
 
who said anything about categories?

??? I did. I was talking about looking at categories of malware that
several good scanners test well in (according to quality tests) that
products to be tested by my method do not do well in. Or, conversely,
I also included that you could also see when a vendor suddenly decided
to drop detection in one of those same categories. It would be obvious
using my method over time when a vendor dropped detection of old DOS
viruses, for example.
why can't i be talking about
specimens that belong in categories that several good scanners *do*
handle but for whatever reason are not themselves handled yet?

I don't understand that sentence. But in order for me to defend my
method which you attacked as being worthless, I would hope that you
would stick to that topic and not wander off onto something else.
and since you require agreement between several good scanners for
inclusion in your hypothetical unscientific test you're actually
increasing the potential size of the set of malware where improvements
will go unnoticed... imagine if you required agreement between all
scanners, then there'd be no room for improvement...

In the case of checking on a scanner with weak Trojan detection, for
example, that scanner is not used in building up the test bed. I see
no problem. And a scanner used in building up the bed of old DOS
viruses can be tested later for a significant drop in detection in
this category.
you mean a 'straw man'... perhaps i am, but really, it would be much
easier to avoid misrepresenting your position if you'd fully specify
your position in the first place, or further specify it when it becomes
clear that you've been too general...

It would be better if you requested clarification before you rejected
my idea outright. You turned off any interest I had in further
discussion or clarification by pontificating and "instriucting" me and
insulting me by referring me to a treatise on logic. That pissed me
off.
so now i know we're talking about a testbed thats been classified by
several scanners in order to weed out duplicates and probable
non-viable samples... so we've hopefully eliminated the possibility of
unpredictable 'improvement' scaling factors but we've introduced the
problem of omitted population segments discussed previously... the
improvement trends you hope to discover may get missed due to the
self-selected sample bias...

Omitted population segments?? Improvement trends get missed? What in
the hell are you talking about?


Art
http://www.epix.net/~artnpeg
 
Back
Top