Sunday, July 27, 2008

The Census Bureau's Random Groups Standard Error Estimator

Rather than a simple random sample, the Census Bureau relied upon a systematic sampling method to select sample units (where a sample unit was defined as either a household or a person living in group quarters) for the 1990 Census sample. This made things easier for census takers--instead of trying to develop a random sample of households in their area, they would administer the long form questionnaire to every sixth household (or every eighth household, or every other household, depending on what part of the country they lived in). This procedure should (hopefully) result in a sample that was more reflective of the population as a whole, than a more haphazard scheme.

The formulas for the standard error a few blogs up were for a simple random sample. One thing that survey statisticians like to do is to come up with different formulas when the sampling method is other than simple random sampling. Generally, your target is a smaller standard error, which makes your confidence intervals appear smaller and your survey estimates more precise.

For estimation for the 1990 Census, the US was divided into just over 60,000 distinct weighting areas, or areas for which sample weights were derived. A "sample weight" is the number of units in the larger population which a sample unit represents (for weighting purposes). For example, if the sampling ratio for households is one-in-six, then the initial guess at how many population households a sampled household represents is six (itself plus five others). As mentioned previously, everyone answered certain questions (on the Short Form), and a systematic sample was selected for the Long Form. Information from questions which everyone answered was used to adjust the sample weights. For example, since everyone answered the question on "race", the sample weights were adjusted to make certain that the sample estimates of persons by "race" equaled the population count of people by race. This hopefully made the estimates from the sample data more reflective of the population.

For each of the 60,000 weighting areas, a standard error estimate, using the random-groups method, was computed for each of 1,804 sample data items.

Within each weighting area, sample units (a sample unit being either a housing unit or a person residing in a group quarter) were assigned systematically among 25 random groups.

For each of the 25 random groups, a separate estimate of the lotal for each of 1,804 sample data items was computed by multiplying the weighted count for the sample data item within the random group by 25. For each data item for which the total number of people with a particular characteristic was estimated from the sample data, the random-groups standard error estimate was then computed from the 25 different estimates of the total from the random groups.

This is hard to describe without formulas. You had better look this up in my paper:

For each data item within a weighting area, a design effect was calculated as the ratio of the random groups standard error estimator to the simple random sampling standard error for a one-in-six random sample (mentioned in my previous post).

For a state report of sample data, the design effects for eaeh data item were averaged across the weighting areas in the state. Then, a generalized design effect for each data item type (for example, all dala items that dealt with occupation) was computed. The generalized design effect was weighted in favor of data items that had higher population estimates.

In my paper, I present a hypothetical example of data that mighl have arisen from the random-groups method. For a weighting area in Vermont, weighted counts of Whites and Blacks are listed for the 25 random groups. In my hypothetical weighting area, there are no persons of other race. The standard errors assuming simple random sampling are the same for Whites and Blacks (as one would expect for a binomial variable). However, the random-groups standard error estimate is much higher for Whites than for Blacks. And, the design effect is nearly five times higher for the estimate of Whites than the estimate of Blacks. Since the generalized design effect computed for groups of data items was weighted in favor of data items that had higher population estimates, the generalized design effect computed for race for the state of Vermont was quite high.

Data on race were frequently included in 1990 U.S. census sample data products. Because race was asked of every census respondent (i.e., it was a census 100-percent data item), and because the weighting process used by the Census Bureau effectively forced the sample estimates by race to match the 100-percent Census counts by race, the standard errors for estimates of race probably should have bccn considered to be zero. However, generalized design effects were still published hy race, although set to arbitrary constants for all reports (rather than as computed by this method).

More on my proposed modification next time.

Saturday, July 26, 2008

Appendices provided for the 1990 Census

Appendix C, Accuracy of the Data is available here:

Table C was left off. For Vermont, Table C is available in Table 1 of my paper:

Data users were instructed to determine a standard error for their Census estimate of interest, based upon a 1-in-6 sampling rate. They could accomplish this either by using a table provided in Appendix C, or by using a formula provided in Appendix C (similar to the formula above).

Then, data users were to multiply this result by a Design Factor provided in Table C. The Design Factor varied depending on the actual sampling rate (as mentioned in the Appendix, actual sampling rates varied from 1-in-8 to 1-in-2).

The Design Factor was an average of the ratios of the Random Groups standard-error estimator to the simple-random-sampling standard error (based on a 1-in-6 sampling ratio).

More on the Random Groups standard-error estimator next time.

Friday, July 25, 2008

The Finite Population Correction Factor

When making survey estimates for a finite population, the standard error is reduced by multiplying the standard error (from the previous post) by a Finite Population Correction Factor (FPC), which is:

FPC = sqrt((N - n)/N)

where N is the population size and n is the sample size. If your survey includes everyone in the population, a.k.a. a census, then N = n and your FPC = 0 . Thus, your standard error is zero, because you know everything about everyone, and there is no uncertainty.

In practice, the FPC can be ignored when the sampling fraction is less than 10%. However, in practice, and as a general rule, survey research organizations do not like to pass up an opportunity to make their estimates look better, and will not ignore the FPC.

For the 1990 Census, everyone responded to a certain number of questions, all of which were contained on the "Short Form." A sample of the population--the sampling rate varied across the country, but in most parts of the country the sampling rate was one-in-six--received a "Long Form" questionnaire. They answered all of questions that were contained on the Short Form, plus more detailed questions on variety of topics, such as how they commuted, what their occupation was, etc.

Including the FPC, the standard error for the estimate of the number of Whites becomes:

se(Nw) = FPC x N x sqrt(p x q/n)

= sqrt((N - n)/N) x N x sqrt(p x q/n)

= sqrt(1 - (n/N)) x N x sqrt(p x q/n)

Taking the sampling rate to be 1/6, or n/N = 1/6, this becomes:

se(Nw) = sqrt(1 - (1/6)) x N x sqrt(p x q/n)

= sqrt(5/6) x (N/sqrt(n)) x sqrt(p x q)

= sqrt(5/6) x sqrt(N/n) x sqrt(N) x sqrt(p x q)

= sqrt(5/6) x sqrt(6) x sqrt(Np) x sqrt(1 - p)

= sqrt(5 x Np x (1-p))

= sqrt(5 x Np x (1 - Np/N))

which is basically the result in the paper for the one-in-six simple-random-sample result. It may come out looking a bit more complicated than it needs to look.

Thursday, July 24, 2008

I will begin my blogging by first discussing the Random Groups Standard Error Estimation procedure, used by the US Census Bureau in the 1990 Census (in in previous Censuses), and which I described here:

In keeping with the example of my paper, suppose that there is a geographic area that has people of only two races: White and Black. You know the total number of people. You want to get an idea of how many are White and how many are Black, without asking everyone in the area. So, you ask a sample of the people in the area whether they are White or Black. Within your sample, you know how the proportion of Whites and Blacks. To estimate the total number of Whites in the area, you simply multiply the proportion of Whites in your sample by the total number of people in area. Similarly for Blacks.

Suppose that P represents the true proportion of Whites in the population, and that Q represents the true proportion of Blacks in the population. The population of this area consists only of Blacks and Whites. Hence, the two proportions add to one (P + Q = 1).

If your sample was a simple random sample of the population, then your best estimate of P is p, i.e. the number of Whites in your sample, divided by your total sample size

p = nw / n

where nw is the number of Whites in your sample, and n is the total number of people (both Black and White) in your sample. Similarly,

q = nb / n,

where nb is the number of Blacks in your sample. Note here that p + q = 1.

Your estimate of the total number of Whites (Nw) in the population is N x p, and your estimate of the total number of Blacks (Nb) in the population is N x q.

The basic formula for the standard error of p is sqrt(p x q/n), which is the same as the formula for the standard error of q.

Sorry I don't have the ability to create and edit equations on this blog. Take sqrt to mean the square root.

If N is a fixed constant, then the standard error of the estimate of the total number of Whites in the population is:

se(Nw) = N x se(p) = N x sqrt(p x q/n).

The standard error of the estimate of the total number of Blacks in the population turns out to be identical:

se(Nb) = N x se(q) = N x sqrt(p x q/n).

However, one should bear in mind that, in sampling theory, there is something called the finite population correction factor, which comes into play especially when your sample is relatively large compared to your population. If you completed a census of everyone in your population, then you know exactly the number of Whites and Blacks in your population. Your numbers are not based on sampling, and there is no sampling error.

More on the Finite Population Correction Factor next time.

Tuesday, July 22, 2008

Welcome to my Blog

You can find details about me and my background by clicking here:

I am presently offering my services as a consultant, offering expertise in statistics, economics and epidemiology, after most of a lifetime spent in gathering this expertise.

My goal in this blog is to provoke thoughts, and to present ideas and information, that will be of interest to statisticians and to others who use statistics and statistical methods, whether applying analytic techniques or simply dealing with statistical stuff. Hence, I'm hoping that economists, epidemiologists, physicists, political scientists, housewives who watch Oprah, and others will find my blog to be intriguing.

I will talk not only about statistical methods, but also about the political and social structures to which people who perform statistical functions must, in one way or another adapt.