Sampling

  • Thread starter Thread starter Rod Urand
  • Start date Start date
R

Rod Urand

Hi all,

I have a list of 1500+ employee id's. Each record contains the
employee ID, their gender and their ethnicity.

I need a random sampling of 500 records that is representative of the
total counts. For example, if there is an 85% ratio of females, then
85% of the random sample must also be female.

This is a three column worksheet - EmplID, Ethnicity, Gender

What are my steps? I tried looking at the Analysis toolpak but i am
not sure which tool to use.
 
in a parallel range, you can enter =RAND() and fill down, then sort by this
column, then look at the first 500 records.
 
Bob,

Thanks for that, not sure it will work. I need to ensure the sample
has precisely the right ratios according to the existing data.

For example, in the list there are 1500 people. 56% are women, 44% are
men.

Of the 500 sample records, 56% or 280 must be women, 44% or 220 must
be men.
 
You will need to "subdivide" the categories and take the first .56 * 500
from the women and the first .44 * 500 men.
 
Rod Urand said:
Thanks for that, not sure it will work. I need to
ensure the sample has precisely the right ratios
according to the existing data.

First, in general, you can never guarantee "precisely the right ratios" due
to quantization limitations. For example, if 25% of 100 people are female,
you cannot randomly select 47 people such that "precisely" 25% are female.

Second, a truly random sampling cannot "guarantee" even close to the same
ratios of all characteristics (you mentioned ethnicity as well as gender).
A truly random sampling can only guarantee certain ratios with some
probability.

If you were interested in guaranteeing the same ratio (within quantization
limitations) of only two categories, e.g. male and female, you could split
the population into male and female, and then take a random sampling of
INT(N*56%) from the male subgroup and a random sampling of N - INT(N*56%)
from the female subgroup. (Note: Do not try to take ROUND(N*56%,0) and
ROUND(N*44%,0). That might not total N.)

Likewise, if you want the same ratio of male/female and white/non-white,
say, you could make 4 subgroups based on the pairwise characteristics. If
43% of males are white, and 47% of females are white, you could take the
following random samplings from each subgroup: INT(M*43%) from white males
and M - INT(M*43%) from nonwhite males, and INT(F*47%) from white females
and F - INT(F*47%) from nonwhite females, where M = INT(N*56%) and F = N -
INT(N*56%).

But as you can imagine, that strategy breaks down very quickly as the number
of pairwise (or "morewise") characteristics increases. Furthermore, the
quantization error increases as the size of subgroups decreases.

At that point, it is best to take truly random samplings of a "sufficient
size" from the entire population, and let the chips fall where they may.

What constitutes a "sufficient size" depends on the standard deviation and
the population size. There are ways to calculate it in simple cases. But
it may be difficult to calculate when there are many characteristics to
consider.

Actually, it is best to take multiple truly random samplings of the same
size, and average the results. But that is not always feasible.


----- original message -----

Rod Urand said:
Bob,

Thanks for that, not sure it will work. I need to ensure the sample
has precisely the right ratios according to the existing data.

For example, in the list there are 1500 people. 56% are women, 44% are
men.

Of the 500 sample records, 56% or 280 must be women, 44% or 220 must
be men.


----- previous message -----
 
Rod -

It sounds like you should sort initially by the Ethnicity and Gender. Then
within each grouping, you can use Bob's suggestion of =RAND(), sort by this
column within each ethnicity/gender group, and take the top 1/3 of each group.

It is not true random sampling if you are required to make the same
breakdown, but it sounds like that is what you want.
 
Back
Top