Mixed numerical and textual data

  • Thread starter Thread starter jimbo
  • Start date Start date
J

jimbo

When 50% of a data set is a text value, and the rest is numeric,
typically the average is calculated by either making the text value a
numeric, then calculating the mean, or, ignoring the text value and
average the numeric values only.

For instance, the lowest detectable limit for serum aluminum may be
<5, which is a text value. The remaining data ranges between 5-100.
Making all <5s = 5 will overestimate the mean because the <5 values
range between 0 and 4.9, not 5. By ignoring the <5 is worse because
if the distribution is gaussian, the mean should be close to 47, where
the other approach has a mean around 20.

But there should be a way to take the slope of the distribution as it
approaches the lower reportable limit and the number of text values
then determine the numeric equivalent of the values <5 and then, with
the numeric values, determine the mean of the data set as well as
other statistical measurements.

Does anyone know of such a calculation, a similar, or equivalent
calculation?
 
When 50% of a data set is a text value, and the rest is numeric,
typically the average is calculated by either making the text value a
numeric, then calculating the mean, or, ignoring the text value and
average the numeric values only.

For instance, the lowest detectable limit for serum aluminum may be
<5, which is a text value. The remaining data ranges between 5-100.
Making all <5s = 5 will overestimate the mean because the <5 values
range between 0 and 4.9, not 5.  By ignoring the <5 is worse because
if the distribution is gaussian, the mean should be close to 47, where
the other approach has a mean around 20.

But there should be a way to take the slope of the distribution as it
approaches the lower reportable limit and the number of text values
then determine the numeric equivalent of the values <5 and then, with
the numeric values, determine the mean of the data set as well as
other statistical measurements.

Does anyone know of such a calculation, a similar, or equivalent
calculation?

Replace "<5" with 2.45.

Or some other value if you know the distribution of the values between
0 & 4.9

Alan Lloyd
 
jimbo said:
For instance, the lowest detectable limit for
serum aluminum may be <5, which is a text value.
The remaining data ranges between 5-100. [....]
But there should be a way to take the slope of
the distribution as it approaches the lower
reportable limit and the number of text values
then determine the numeric equivalent of the
values <5 and then, with the numeric values,
determine the mean of the data set as well as
other statistical measurements.

Does anyone know of such a calculation, a similar,
or equivalent calculation?

I think what you have in mind is:

1. Prorate the average of the ungrouped numerical amounts (5 to 100)
to the grouped text amounts ("<5", which I assume is 0 to 5).

2. Compute the weighted average of the two.

Suppose you have 1000 data points in A1:A1000, some numerical and the rest
being "<5".

AVERAGE(A1:A1000) is the average of the ungrouped numerical amounts because
AVERAGE ignores empty cells and cells with text.

Similarly, COUNT(A1:A1000) and MAX(A1:A1000) are the count and max of the
ungrouped numerical amounts.

COUNTA(A1:A1000) is the count of both ungrouped numerical and grouped text
amounts, ignoring any empty cells.

The prorated average of grouped amounts ("<5") is:

=5*(AVERAGE(A1:A1000)-5)/(MAX(A1:A1000)-5)

And the weighted average of the two groups (numericals and "<5") is:

=AVERAGE(A1:A1000)*COUNT(A1:A1000)/COUNTA(A1:A1000)
+ 5*(AVERAGE(A1:A1000)-5)*(COUNTA(A1:A1000)-COUNT(A1:A1000))
/COUNTA(A1:A1000)/(MAX(A1:A1000)-5)

Of course, you can simplify and optimize by calculating AVERAGE(A1:A1000),
COUNT(A1:A1000) and COUNTA(A1:A1000) one time in helper cells.

Note: Instead of "5*...", you might want to substitute the smallest
measurable amount less than 5. For example, if your measuring instrument
has a precision of 0.01, use "4.99*..." instead of "5*...". However, keep
"...-5" as is because that is the lowest measurable value of the ungrouped
numerical amounts.
 
Errata....
The prorated average of grouped amounts ("<5") is:

=5*(AVERAGE(A1:A1000)-5)/(MAX(A1:A1000)-5)

And the weighted average of the two groups (numericals and "<5") is:

=AVERAGE(A1:A1000)*COUNT(A1:A1000)/COUNTA(A1:A1000)
+ 5*(AVERAGE(A1:A1000)-5)*(COUNTA(A1:A1000)-COUNT(A1:A1000))
/COUNTA(A1:A1000)/(MAX(A1:A1000)-5)

I tried to write the formula so that there was no assumption of an upper
bound for the ungrouped numerical amounts, even though you said it is 1000.
In some situations, we might have know the upper bound.

However, obviously that formula does not work correctly if all the numerical
amounts are the same, notably 5 or even 5.01.

So I guess we must always know some upper bound; that makes sense for
mapping the distribution. And MAX(A1:A1000)-5 should be 1000-5 = 95.
 
jimbo said:
For instance, the lowest detectable limit for
serum aluminum may be <5, which is a text value.
The remaining data ranges between 5-100. [....]
But there should be a way to take the slope of
the distribution as it approaches the lower
reportable limit and the number of text values
then determine the numeric equivalent of the
values <5 and then, with the numeric values,
determine the mean of the data set as well as
other statistical measurements.
Does anyone know of such a calculation, a similar,
or equivalent calculation?

I think what you have in mind is:

1. Prorate the average of the ungrouped numerical amounts (5 to 100)
   to the grouped text amounts ("<5", which I assume is 0 to 5).

2. Compute the weighted average of the two.

Suppose you have 1000 data points in A1:A1000, some numerical and the rest
being "<5".

AVERAGE(A1:A1000) is the average of the ungrouped numerical amounts because
AVERAGE ignores empty cells and cells with text.

Similarly, COUNT(A1:A1000) and MAX(A1:A1000) are the count and max of the
ungrouped numerical amounts.

COUNTA(A1:A1000) is the count of both ungrouped numerical and grouped text
amounts, ignoring any empty cells.

The prorated average of grouped amounts ("<5") is:

=5*(AVERAGE(A1:A1000)-5)/(MAX(A1:A1000)-5)

And the weighted average of the two groups (numericals and "<5") is:

=AVERAGE(A1:A1000)*COUNT(A1:A1000)/COUNTA(A1:A1000)
+ 5*(AVERAGE(A1:A1000)-5)*(COUNTA(A1:A1000)-COUNT(A1:A1000))
/COUNTA(A1:A1000)/(MAX(A1:A1000)-5)

Of course, you can simplify and optimize by calculating AVERAGE(A1:A1000),
COUNT(A1:A1000) and COUNTA(A1:A1000) one time in helper cells.

Note:  Instead of "5*...", you might want to substitute the smallest
measurable amount less than 5.  For example, if your measuring instrument
has a precision of 0.01, use "4.99*..." instead of "5*...".  However, keep
"...-5" as is because that is the lowest measurable value of the ungrouped
numerical amounts.

What you have here is "censored" data - that is you know that the data
exists (and where) but not the actual value. If the data has a
recognizable distribution such as the log-normal (ie if your data
ibecomes normal if you log it) then there may well be a way of
estimating the mean. This seems to be the sort of think that you are
hinting at.
As a first step,you could try logging the data and drawing a histogram
of it. If the logged data looks normal with stuff on the left side
missing then you have censored log normal data. xt
 
Back
Top