Willard Losinger's Statistical Blog

Tuesday, May 6, 2014

Overanalysis and Analytic Quackery

Theoharakis and Skordia (2003) noted that “the recognition and development of an academic institution depends heavily on its faculty’s publication record in prestigious journals. As a result, an increased emphasis is placed on publishing in refereed journals and promotion criteria rest heavily on the faculty’s publication record.” The most prestigious scientific journals are more likely to accept for publication an article that includes a sophisticated-looking statistical model that serves as a hook. Thus, many scientific researchers are often under pressure to produce sophisticated-looking statistical models. Indeed, an article that used relatively simple statistical techniques would be difficult to pass through the review process of many scientific journals. Thus, statistical consultants who do not wish to be labeled as “incompetents” are obliged to come up with models that provide an air of sophistication.

Vardeman and Morris (2003) provided some advice regarding statistics and ethics for young statisticians: “Resolve that if you submit work for publication, it will be complete and represent your best effort. Submitting papers of little intrinsic value, half-done work, or work sliced into small pieces sent to multiple venues is an abuse of an important communication system and is not honorable scholarship…never borrow published/copyrighted words, even of your own authorship, without acknowledgement. To do so is plagiarism and is completely unacceptable,” etc. etc. etc. Vardeman and Morris (2003) go on to say: “Society also recognizes that when statistical arguments are abused, whether through malice or incompetence, genuine harm is done…Society disdains hypocrisy…(and)…has contempt for statisticians and statistical work that lack integrity…Principled people consistently do principled work, regardless of whether it serves their short-term personal interests. Integrity is not something that is turned on and off at one’s convenience. It cannot be generally lacking and yet be counted on to appear in the nick of time when the greater good calls.” Yada yada yada.

Society may have contempt for statisticians (and for statistical work that lacks integrity). Principled people may, in principle, consistently do principled work. The abuse of statistical arguments may occasionally cause genuine harm. However, society has considerable tolerance for hypocrisy. Integrity is frequently a matter of convenience--particularly when it comes to getting published in prestigious journals. And, plagiarism is generally well accepted, as long as it is done in a subtle manner. You don’t want to draw too much attention to it, so that no-one is likely to notice or care.

To provide a tangible example: a paper on the association between bovine-leukosis virus (BLV) and herd-level productivity on US dairy farms (Ott, Johnson and Wells, 2003) represents at least the fourth in a series of very similar statistical analyses that came from the same set of data (i.e., the Dairy ’96 Survey, which was conducted by the National Animal Health Monitoring System, NAHMS, of the United States Department of Agriculture, USDA). The first analysis, which examined the economics of Johne’s disease on dairy farms, was presented initially in a report that was published by the USDA (1997), and which is also available on line (http://www.aphis.usda.gov/vs/ceah/ncahs/nahms/dairy/Dairy96/DR96john.pdf). The same methods and results were presented again by Ott, Wells and Wagner (1999). Articles on economic impacts of Bovine Somatotropin (BST) (Ott and Rendleman, 2000) and bulk-tank somatic-cell counts (BTSCC) (Ott and Novak, 2001) followed. The statistical models of Ott, Johnson and Wells (2003) are given in Table 1, and the other statistical models appear in Table 2.

In most of the papers, the principal variable for analysis was what Ott, Johson and Wells (2003) termed the “Annual Value of Production” (AVP), which Ott, Wells and Wagner (1999) called “Annual Adjusted Value of Production,” and which Ott and Novak (2001) called the “Value of Dairy Herd Productivity.” AVP was derived on an annual per-cow basis as the sum of the value of milk production (milk priced at 28.6 cents/kg) and the value of newborn calves (valued at $50 each), minus the net replacement cost. The “net replacement cost” was the cost of replacements (priced at $1100 each), minus the value of cows sold to other producers (priced at $1100 each) and to slaughter ($400 for cows in good condition, $250 for poor-condition cows). Ott and Rendleman (2000) analyzed “Non-Milk Productivity” (AVP minus the value of milk production). In addition, Ott and Rendleman (2000), Ott and Novak (2001), and Ott, Johnson and Wells (2003) used milk production per cow as a variable for analysis.

Creating a dependent variable by combining dollar-values attributed to various input- and output-quantities (as Ott, Johnson and Wells, 2003, did for AVP) is a rather unusual technique for analyzing multi-output production. In theory, producers make production decisions based upon the prices (and other constraints) that they face. Prices that producers receive and pay may vary considerably from one producer to the next. One producer may make production decisions that are very different from another, but that are appropriate given that producer’s conditions. Assigning the same dollar-value to outputs for all producers, and then summing the results to generate a dependent variable for analysis, may lead to erroneous conclusions that some producers are achieving higher profits than others based upon certain independent variables when, in fact, all may be maximizing profit given their particular constraints.

In proceeding from Ott, Wells and Wagner’s (1999) models to the models of Ott and Rendleman (2000) (Table 2), the “Johne’s Disease” variables were dropped, and the functional form for percent BST use was transformed from the square root to a quadratic expression. Ott, Wells and Wagner (1999) chose a square-root representation for percent BST use because “initial analysis demonstrated a non-linear relationship between milk production and percent BST use,” and “in part because of the large number of herds that did no use any BST.” Ott and Rendleman (2000) used a quadratic term “to measure a potential declining marginal physical product of milk production as rBST increases.” Ott and Novak (2001) used a simple linear term for percent BST use. Because “Ott and Rendleman (2000) found that as the percentage of cows being administered BST rose, the associated marginal increase in milk yield became smaller,” Ott, Johnson and Wells (2003) reverted to a square-root representation for percent BST use.

A new variable introduced by Ott, Johnson and Wells (2003) was the percent of cows in third or greater lactation (via “piece-wise regression”). In addition, Ott, Johnson and Wells (2003) added two new “management index” variables that resulted from a “correspondence analysis” that combined 24 variables into 2. In previous analyses, the use of Dairy Herd Improvement Association (DHIA) records “served as a proxy measure for management capability” (Ott, Wells and Wagner, 1999). Ott and Novak (2001) stated that they had attempted to combine 18 variables of management practice into four management indices, using factor analyses, to account for the influence of management ability on AVP. Ott and Novak (2001) decided to use DHIA records as a measure of management ability because 83% of the increase in the R-squared value could be obtained from the use of DHIA records, and because including additional management variables reduced the number of respondents with complete information by 6%. The R-squared values for the various models were not substantially different across analyses for the same dependent variable (Table 2).

Ott, Johnson and Wells (2003) described the creation of the sample weights for use in the analysis. The sample weight indicates the number of farms in the population that each farm in the sample represents. Because large farms (that account for a large portion of the animal population) are sampled at a much higher rate than the more numerous small farms (that account for a small proportion of the animals), large farms typically receive much smaller sample weights than small farms in NAHMS national studies (Losinger, 2002). Thus, responses from small farms tend to have a greater impact on farm-level estimates than responses from large farms. For animal-level estimates from NAHMS surveys, it is customary to modify the sample weights to reflect the number of animals (rather than the number of farms represented) by multiplying the sample weight by the number of animals (Losinger, 2002). Thus, large farms tend to receive much higher animal-level weights (which are emblematic of the number of animals that each participating operation represents) than small farms. Ott, Johnson and Wells (2003) used farm-level rather than animal-level weights in their AVP and milk-production models. The model for milk production is in terms of kg per cow per operation (rather than kg per cow). Using farm weights for animal-level estimates can yield highly inaccurate results.

Ott, Johnson and Wells’ (2003) computation of the reduction in equilibrium milk production was based on a $59 decline in AVP for BLV-positive herds, in addition to the demand and price elasticities for milk (Ott, Johnson and Wells, 2003). Ott, Johnson and Wells (2003) should have used the decline in milk production for BLV-positive herds, rather than the decline in AVP, because they were analyzing changes in equilibrium milk production. Analyses for the demand and supply of calves and culled cows should have been performed separately.

Ott, Johnson and Wells’ (2003) determination of a $285 million economic-surplus loss for producers, a $240 million economic-surplus loss for consumers, and a consequent sum-total loss to the economy of $525 million (due to reduced milk production in BLV-positive herds), differs substantially from what economic theory would ordinarily suggest. The presence of BLV in dairy cows may reduce milk production. Reduced milk production causes the equilibrium market price for milk to rise while the quantity falls. While a loss in economic surplus accrues to consumers, a portion of this loss is transferred to producers as an economic gain (Nicholson, 1995). Therefore, the loss to the economy is not the sum-total of economic-surplus losses experienced by producers and consumers. Ott, Johnson and Wells (2003) failed to provide precise details on how exactly they measured economic losses from reduced milk production due to BLV. Ott, Johnson and Wells (2003) made reference to the procedure described by Ott, Seitzinger and Hueston (1995), but state that they did not include “losses associated with potential lost international trade” (which Ott, Seitzinger and Hueston, 1995, had emphasized). The fact that Ott, Johnson and Wells’ (2003) estimate of total loss to the economy (as a result of reduced milk production attributed to BLV in dairy cows) equaled the sum of the changes in producer and consumer surplus, suggests that Ott, Johnson and Wells (2003) may have either double-counted the economic surplus that transferred between consumers and producers as a result of reduced milk production attributed to BLV in dairy cows, or ignored the transferred surplus when computing the change in either producer or consumer surplus.

Ott, Johnson and Wells (2003) determined “marginal effects associated with a percentage-point change in herd-level seropositivity of BLV” as the coefficient associated with the “BLV-prevalence” variable that resulted from replacing their model’s dependent variable (AVP) with the various individual components of AVP (in terms of both quantity and attributed-dollar value) (Table 3). Ott, Wells and Wagner (1999) followed the same procedure to establish the “marginal impact of Johne’s disease on dairy production parameters” (Table 4). This procedure is inappropriate, because factors that influence one component of production would be expected to differ substantially from factors that influence another. Some components, particularly the number of calves born, would not be expected to have a normal distribution (therefore, a linear-regression model would not apply). A Poisson distribution would have been more likely for this variable, and the authors should have considered a Poisson regression. The R-squared values were quite low for some of the components (0.08 for the number of calves born, 0.09 for cow mortality, and 0.11 for cows sold to other producers), and demonstrated that the predictive power of the model of Ott, Johnson and Wells (2003) was rather poor when applied to many of the individual components of AVP.

The models of Ott, Wells and Wagner (1999) had Johne’s disease in terms of positive or negative herds, and in terms of no culled cows with clinical signs, >0 but < 10% of culled cows with clinical signs, and >10% of culled cows with clinical signs (Table 2). The models of Ott and Novak (2001) were based on a low, medium and high differentiation for BTSCC. Some milk processors pay producers less when BTSCC is elevated, or pay premiums for milk with low BTSCC levels (Ott and Novak, 2001). This implies different demand curves for milk based on the level of BTSCC. Differences in the construction of the variable of interest, in addition to the fact that this was the first analysis that incorporated elasticities, render questionable the comparisons offered by Ott, Johnson and Wells (2003).

Results from separate model equations analyzing the economic costs of Johne’s disease, BTSCC and BLV do not imply that the economic benefit of eliminating all three conditions would equal the sum of the economic costs associated with each condition. Each regression model invokes the ceteris paribus assumption. If Johne’s Disease is eliminated before (or in tandem with) BTSCC and BLV, then ceteris are no longer paribus. Estimating the cumulative impact of eliminating all of these conditions would require a model that incorporated all of these variables, and that included a covariance analysis. Ott, Johnson and Wells (2003) did perform a multicollinearity test for their explanatory variables (which included BLV and BTSCC, but not Johne’s disease), and considered multicollinearity “not to be a problem” because “the maximum association of any single explanatory variable with the others was <50 also="" and="" appeared="" applying="" associations="" blv.="" blv="" both="" btscc="" but="" disease="" examination="" examined="" far="" font="" found="" has="" included="" johne="" multicollinearity="" no="" not="" of="" ott="" s="" same="" so="" test="" that="" the="" wagner="" wells="" were="">

Finally, the limitations inherent in performing repeated analyses from the same set of data must be vigorously emphasized. When one carries out multiple analyses to develop models that fit the data well, the ability of the models to make predictions from new data may be considerably less than the R-squared values would suggest (Neter and Wasserman, 1974). The models of Ott, Johnson and Wells (2003) (and of the preceding economic analyses from the Dairy ’96 Study) do indicate some relationships between disease and production. However, over-analysis and excessive data-tweaking can cause “statistical significance” to lose its meaning, however impressive the final results may appear.

Some researchers may question the ethics of using similar statistical models multiple times. For example, Vardeman and Morris (2003) state: “Resolve that if you submit work for publication, it will be complete and represent your best effort. Submitting papers of little intrinsic value, half-done work, or work sliced into small pieces sent to multiple venues is an abuse of an important communication system and is not honorable scholarship.” Many of the methods and results that formed the basis of the four economics articles that came from the NAHMS Dairy ’96 Study were very similar, and probably could have been combined into one paper. Vardeman and Morris (2003) also say: “never borrow published/copyrighted words, even of your own authorship, without acknowledgement. To do so is plagiarism and is completely unacceptable.” Parts of the descriptions of the analytic procedures in various places across the four economics papers from the NAHMS Dairy ’96 Study are almost identical. For example, in describing the multicollinearity tests, Ott, Wells and Wagner (1999) wrote: “The maximum association of any one explanatory variable with the others was <50 1999="" 50="" a="" added="" already="" analysis="" and="" annual="" any="" are="" associated="" assumed="" aximum="" be="" been="" begun="" by="" citing="" computing="" correspondence="" could="" cows="" dairy="" described="" design="" detail="" easier="" especially="" explain="" follow="" font="" for="" from="" greater="" had="" have="" if="" in="" information="" is="" johne="" johnson="" lactation="" less="" management="" methods="" models="" more="" multicollinearity="" not="" of="" or="" ott="" percent="" piece-wise="" plus="" practices="" previous="" problem.="" production="" reduction="" regression="" removed="" rendleman="" repeating="" restating="" s-disease="" selection="" simply="" space="" stated="" study="" test="" than="" that="" the="" then="" they="" third="" this="" thus="" to="" two="" unnecessary.="" value="" variable="" variables.="" variables="" wagner="" was="" wells="" which="" without="" work="" would="" wrote:="">

Finally, most scientific researchers would agree with that statement of the National Institute of Standards and Technology (1994) that “a measurement result is complete only when accompanied by a quantitative statement of its uncertainty. The uncertainty is required in order to decide if the result is adequate for its intended purpose and to ascertain if it is consistent with other similar results.” Ott, Johnson and Wells (2003) concluded that BLV in dairy cows caused a $525 million loss to the economy because of reduced milk production, and provided no statement of their estimate’s uncertainty. The computation was based partially on an elasticity of demand (for milk) provided by Wohlgenant (1989), and on an elasticity of supply (for milk) provided by Adelaja (1991), neither of whom examined the uncertainty of their elasticities. Computer programs are widely available for computing uncertainties. For example, @RISK 4.5 (Palisade Corporation, 2002) allows users to specify the uncertainty involved in all key variables, with numerous probability density functions. The GUM Workbench (Metrodata GmbH, 1999) follows guidelines established by the European Co-operation for Accreditation (1999) for computing, combining, and expressing uncertainty in measurement.

References

Adelaja, A.O., 1991. Price changes, supply elasticities, industry organization, and dairy output distribution. Am. J. Agric. Econ. 73, 89-102.
Debertin, D.L., 1986. Agricultural Production Economics. Macmillan Publishing Company, New York.
European Co-operation for Accreditation, 1999. Expression of the Uncertainty of Measurement in Calibration. EA-4/02, European Co-operation for Accreditation, Utrecht, The Netherlands. 79 pp.
King, L J., 1990. The National Animal Health Monitoring System: fulfilling a commitment. Prev. Vet. Med. 8, 89-95.
Losinger, W.C., 2002. A look at raking for weight adjustment. Stats: The Magazine for Students of Statistics, 33(1): 8-12.
Metrodata GmbH, 1999. GUM Workbench: The Tool for Expression of Uncertainty in Measurement. Manual for version 1.2 English Edition. Teknologisk Institut, Taastrup, Denmark.
National Institute of Standards and Technology, 1994. Guidelines for evaluating and expressing the uncertainty of NIST measurement results. NIST Technology Note 1297. National Institute of Standards and Technology, Gaithersburg, Maryland, USA.
Netter, J., Wasserman, W., 1974. Applied Linear Statistical Models. Richard D. Irwin, Inc., Homewood, Illinois.
Nicholson, W., 1995. Microeconomic Theory Basic Principles and Extensions, 6th edn. Dryden Press, Fort Worth.
Ott, S.L., Johnson, R., Wells, S.J., 2003. Association between bovine-leukosis virus seroprevalence and herd-level productivity on US dairy farms. Prev. Vet. Med., 61, 249-262.
Ott S.L., Novak ,P.R., 2001. Association of herd productivity and bulk-tank somatic cell counts in US dairy herds in 1996. J. Am. Vet. Med. Assoc. 218, 1325-1330.
Ott, S.L., Rendleman, C.M., 2000. Economic impacts associated with bovine somatotropin (BST) use based on a survey of US dairy herds. AgBioForum 3, 173-180.
Ott, S.L., Seitzinger, A.H., Hueston, W.D., 1995. Measuring the national economic benefits of reducing livestock mortality. Prev. Vet. Med. 24, 203-211.
Ott, S.L., Wells, S.J., Wagner, B.A., 1999. Herd-level economic losses associated with Johne’s disease on US dairy operations. Prev. Vet. Med. 40, 179-192.
Palisade Corporation, 2002. Guide to Using @RISK Risk Analysis and Simulation Add-In Software for Microsoft Excel, Version 4.5. Palisade Corporation, Newfield, New York.
Pollock, S., 2002. Recursive Estimation in Econometrics. Queen Mary University of London, Working Paper No. 462.
Theoharakis, V., and Skordia, M. (2003), “How do Statisticians Perceive Statistical Journals?” The American Statistician, 57, 115-123.
US Department of Agriculture, Animal and Plant Health Inspection Service, 1996. Part I: Reference of 1996 Dairy Management Practices. USDA:APHIS:VS, Centers for Epidemiology and Animal Health, Fort Collins, Colorado.
US Department of Agriculture, Animal and Plant Health Inspection Service, 1997. Johne’s disease on US dairy operations. USDA:APHIS:VS, Centers for Epidemiology and Animal Health, Fort Collins, Colorado.
Vardeman, S.B., Morris, M.D., 2003. Statistics and Ethics: Some Advice for Young Statisticians. The American Statistician 57, 21-26.
Wineland, N.E., Dargatz, D.A., 1998. The National Animal Health Monitoring System a source of on-farm information. Veterinary Clinics of North America 14, 127-139.
Wohlengant, M.K., 1989. Demand for farm output in a complete system of demand functions. Am. J. Agric. Econ. 71, 241-252.

Table 1. Model showing associations between explanatory variables and annual value of production and milk production. Standard errors are in parentheses.

Annual value of production Milk production

Variable (US$ per cow) (kg/cow)

BLV prevalence (% seropositive) -1.28 (0.49) -4.7 (1.7)

Herd size (natural log) 65.33 (21.57) 220.9 (75.2)

Region

Midwest Reference Reference

West 9.37 (44.95) 49.3 (156.4)

Southeast -157.06 (67.69) -547.8 (220.4)

Northeast -12.68 (34.23) -54.0 (117.0)

Bulk-tank somatic cell count (thousands of cells/ml)

Low (<200 font="" reference="">

Medium (200-399) -75.45 (32.26) -229.9 (109.7)

High (400+) -261.94 (43.26) -759.0 (146.5)

Intensive pasture grazing (pastures supply >90%

(of summer forage) -107.33 (42.85) -409.1 (145.7)

% of cows administered rBST

Square root 29.45 (5.06) 110.7 (16.5)

% Holstein breed 7.30 (0.54) 25.5 (1.9)

Days dry, >70 days -78.63 (39.28) -280.7 (133.0)

Cows in third lactation

% of herd 6.24 (2.30) 11.9 (7.9)

% in excess of 37% -10.91 (3.14) -36.5 (10.6)

Management practices

Dimension 1 -207.29 (35.87) -755.2 (123.8)

Dimension 2 -213.24 (52.90) -867.7 (179.2)

>90% of cows registered 76.70 (40.57) 193.5 (141.8)

% change in dairy cow inventory -9.87 (0.73) -4.9 (2.6)

Intercept 1139.60 (124.72) 5014.6 (436.5)

R-squared 0.534 0.535

--------------------------------------------------------------------------------------------------------------------

Source: Ott, Johnson and Wells, 2003

Wednesday, April 16, 2014

Defense Manpower Data Center Surveys: A Pile of Excrement

The Defense Management Data Center (DMDC) is a little-known government agency whose principal claim to fame is having once included Linda Tripp on its payroll. The DMDC bills itself as "the authoritative source of information on over 42 million people connected to the United States Department of Defense" (DoD).

Ostensibly to support the information needs of the Undersecretary of Defense for Personnel and Readiness (and of other entities affiliated with the DoD), the DMDC maintains a highly aggressive schedule of personnel surveys, such that individuals may be contacted for multiple surveys each year. Information that derives from sample surveys could potentially be useful in formulating personnel policies and decisions. However, the statistical methods used in DMDC surveys are so shitty that really none of the numbers or figures provided in DMDC reports should be trusted at all. The DMDC's poor performance severely affects the DoD's ability to carry out its missions, and reflects gross mismanagement and an egregious waste of funds (millions of dollars per year). That is millions of fucking taxpayer dollars, per year, plus the time of hundreds of thousands of people contacted to participate in the surveys, to produce ridiculous piles of shit that are supposed to satisfy some insipid DoD bigshots' information needs.

The root cause of this hideous waste is toxic and abusive leadership within the DMDC. Toxic leadership is acknowledged to be severe problem throughout the DoD, but is outrageously so at the DMDC. The DMDC suffers from a lack of professionalism at all levels, but most acutely so within the managerial class. An employee who dares display a modicum of integrity risks being subjected to a “Perfomance Improvement Plan”, which entails many hours of opprobrious and excruciating supervisory “counseling”, and almost invariably culminates in the employee’s removal from the federal service. The DMDC will simply not be able to correct its statistical mistakes, and begin to generate useful products, until its virulent management troubles have been fully rectified. Unfortunately, the DMDC seems to be run by incorrigible, poopyheaded scoundrels who just don't give a damn.

Deficiencies in DMDC Statistical Methods

From September, 2006, until July, 2008, I held the position of Lead Survey Statistician with the DMDC. My principal duties included providing advice on the quality of databases, and on appropriate methods of statistical analysis for sample surveys. To that end, I identified a number of critical deficiencies with the DMDC's surveys whilst tasked to write statistical methodology reports to describe the sample design, sample selection, weighting, and estimation procedures for the DMDC's 2006 Survey of Active Duty Spouses (2006 ADSS) and for the DMDC's 2006 Survey of Reserve Component Spouses (2006 RCSS).
Briefly:

stratification of the sample without proper consideration of the survey objectives rendered impossible the attainment of reportable information for many desired population groups;

when problems were observed in the sampling, no efforts were made to re-stratify to correct the problems--instead, the decision was made to accept the fact that some of the wanted information would be unattainable;

extreme over-stratification caused many of the sampling strata to have very small numbers of respondents, both expected and actual;

logistic-regression models (used first to adjust sample weights for unknown eligibility and subsequently for survey completion among eligible respondents) contained an extremely large number of explanatory variables (plus two-way crossings), which led to absurd weight adjustments-- most weights were adjusted very little (or not at all), while a few weights received enormous adjustments (sometimes more than 100-fold);

no efforts were undertaken either to examine or to mitigate the effects of excessively variable weight adjustments, which can cause severely warped estimates;

post-stratification cells (used in the final post-stratification adjustment to the weights) were inconsistent with the survey objectives;

the post-stratification adjustment demonstrated that many survey estimates following the second logistic-regression-model adjustment were off by quite a lot (as much as 42 percent);

sampling strata were collapsed together to form new “variance strata” for variance estimation, which effected a downward bias in the variance estimates, in order to make the survey results appear more precise and accurate than they actually were. The variance estimates that resulted from the creation of the “variance strata” were inappropriate because the variance estimates did not reflect the actual sampling design. The margins of error presented with the 2006 ADSS survey results grossly misrepresented the actual uncertainty associated with the estimates.

These findings are summarized in an article that appeared in Armed Forces and Society.

DMDC Management Response

Consequently, on April 5, 2007, my supervisor presented me with a “Letter of Warning”, wherein she stated that I was performing at an "Unacceptable" level in the Critical Job Element "Team Participation and Organizational Orientation." The “Letter of Warning” included the following accusations: "You have attempted, for public consumption, to characterize DMDC statistical methodology as wrong, even dishonest...Specifically, the draft methodology reports received February 8 and March 4 for the 2006 Survey of Reserve Component Spouses and the 2006 Survey of Active Duty Spouses, respectively, were essentially commentaries, critical reviews, evaluations, text book-proposals for redesign--anything but methodology reports that simply described procedures and results. Stratification, allocation, design effects, weighting adjustments, variance estimation were all explained in a manner leading to the conclusion that DMDC's methodologies were wrong. Statements such as these from the February 4 draft of the 2006 RCSS report are absolutely inappropriate: 'The fact that all of the design effects for all of the reporting domains were greater than one probably should have raised an alarm that something was amiss with the design.' 'Creating "variance strata" that are completely different from the sampling strata is not an entirely honest proposition.'"

Another point that my supervisor made in her "Letter of Warning" was that the three statisticians (whose work I was presumably leading) were "highly knowledgeable about the mechanics of the sampling and weighting process...but they are not equipped to make independent statistical judgments." Well, why the bloody Hell not? They all had at least as much education as I did (one of them even possesses a God-damned PhD, for Christ's sake), and many more years of agency experience. Why weren't they equipped to make "independent statistical judgments?" Does she consider them to have been a bunch of incompetent, fucking retards? "You have proposed ideas and identified errors, but often left them without hands-on guidance for execution. Yon have also passed on to them work that they should not be expected to do independently." If such is the case, then why the fuck should the DMDC want to retain these knuckleheads in their positions?

The “Letter of Warning” established a 90 day period during which I was to be provided with an opportunity to improve my performance “to at least the minimally successful level in this critical element” for me to be retained in my position. Which, of course, was utter bullshit. Government human-resources specialists generally refer to a supervisor's purported expectations surrounding such a 90-day period as a "Performance Improvement Plan" (usually abbreviated to PIP, given their intense fondness for Three Letter Acronyms, or TLA's).

On June 25, 2007, shortly after taunting me about my disability, my supervisor informed me that she had decided to extend "the formal opportunity period to demonstrate at least the minimally acceptable level of performance" from July 25 to August 28. I asked my supervisor about the likely outcome, and she replied with a sneer that it was “going to be a war”, and that the extension to the opportunity period was “just postponing the inevitable.” My supervisor evidently extended the "formal opportunity period" for the sake of affording her the opportunity to compose a few damaging and insolent memoranda, that she could use to support her case for removing me from the federal service, and which she had left neglected during the initial 90-day period. During her extension, my supervisor presented me with a series of strenuously damning, denunciatory and nasty documents, each entitled "Memorandum of Counseling." The August 2, 2007 “Memorandum of Counseling” stated that "DMDC's statistical methodology has been informed by private and academic research organizations and driven by real-world trade-offs. You find fault with many aspects of the methodology; indeed, you raise valid issues, but not one of them is news to DMDC."

Finally (surprise, surprise), on September 21, 2007, my supervisor accorded me a “Proposal of Removal for Unacceptable Performance”, based upon “unacceptable performance in Critical Element 5 of your Performance Plan: Team Participation and Organizational Orientation.”[5] The attack began as follows: “In terms of Critical Element 5, your performance of the above duties and tasks has failed most notably with respect to Organizational Organization (sic erat scriptum); the sub-element Seeks to bring credit to DMDC has been a serious problem. The performance standards describe the unsuccessful level of this element as: Uninterested in DMDC management goals and team decisions. More precisely, the problem has been your failure to accept the body of DMDC’s methodological decisions and to incorporate the resulting practices and conventions into the performance of your duties. This was the problem in February with the 2006 RCSS and 2006 ADSS statistical methodology reports, in which you attacked DMDC's methodology as wrong, even dishonest…Costs of these failures in Organizational Orientation were high…the issues you raise are valid but nothing new; your contribution to re-examining them would be welcome.”

Note that the February time frame of the statistical methodology reports precedes the start of the 90-day period (April 5, 2007) during which I was ostensibly to be afforded the opportunity to “improve my performance to at least the minimally acceptable level.” Moreover, my supervisor’s choice of words, that I had “attacked DMDC’s methodology as wrong, even dishonest”, demonstrates that she believed me to have been reporting gross mismanagement. As sensitive as folks at the DMDC may be to perceived “attacks” on DMDC methodologies, the Whistleblower Retaliation Act prohibits (at least officially, if not in practice) "any federal employee who has the authority to take, direct others to take, recommend, or approve any personnel action from taking, or threatening to take, a personnel action with respect to any employee because of any disclosure of information by an employee which the employee reasonably believes evidences gross mismanagement."[6] Which isn't to say that anyone at the fucking Office of Special Counsel actually gives a shit.

The “Proposal of Removal for Unacceptable Performance” continued with a long series of libelous comments, vicious tripe, and inane accusations, and concluded with: “DMDC survey products support the personnel information needs of the Under Secretary of Defense for Personnel Readiness (sic) and other DoD customers. The information they provide is critical to formulating a wide range of personnel policies and decisions. Your poor performance has negatively impacted the organization’s ability to carry out this mission. Before proposing this action, we considered the feasibility of assigning you to another position at or below your present grade in lieu of removal from the Federal Service. However, there are no suitable vacancies anywhere in DMDC. Because of the nature of DMDC’s work, the likelihood of your success in any other position would be no greater than in your current position.” What a sweetie.

The basic templates for establishing the paperwork required to remove a career employee from the federal service, based upon alleged “unacceptable performance”, were provided by a human-resources specialist, and my supervisor told me that each document that she had prepared was reviewed and approved by a DoD lawyer. The “Proposal of Removal for Unacceptable Performance” was ultimately found to be ridiculous and unacceptable, and was consequently rescinded, as I successfully rebutted each of her points. It took me a long time (about two weeks), as her proposal went on and on for about 20 pages, and her supporting materials filled a good-sized cardboard box. Her commentary was deliberately hurtful and insulting. But, once I got started, it was as easy as goosing federal managers at a donut-eating contest--just very time-consuming, as I was compelled to compare each accusation against her supporting material. The bulk of her claims consisted of outright lies, dastardly distortions and egregious exaggerations.

For example, she wrote: "In terms of Critical Element 5 your perfomance has also failed with respect to Team Participation....the problem with the sub-element Cooperates with Others has been that in working with others, you have attempted to pass off your responsibilities to them. This was the problem all April 26 when you requested that the FCAT-M Statistician produce a new allocation reflecting response rates based on email coverage; the Statistician did not know how to compute those rates; in fact it was your responsibility to provide them to the Statistician." In fact, my supervisor's records clearly showed that I did compute those rates, that I did not ask the Statistician to compute those rates, and that I did provide those rates to the Statistician. Moreover, those rates were very simple to compute, and the fucking Statistician should have been able to compute those rates by himself, without help. He had a fucking master's degree, in Statistics. What, indeed, would be the point of keeping such fucking imbeciles on the payroll?

The fucking agency’s lawyer made certain to include verbiage in the Settlement Agreement that I would waive any rights that I might have had, including any entitlement to accommodation, under both the Rehabilitation Act and the Americans with Disabilities Act. The lawyer may have been unaware (and probably didn't care) that the Americans with Disabilities Act doesn't even apply to federal employees. Further, the agency’s lawyer insisted on including: “each party to pay its own attorney’s fees.” My supervisor never paid a dime for the many hours of legal advice and assistance that she received, over a period of several months, from the fucking agency’s lawyer. After deliberately placing me in a position where I needed to hire a lawyer, the very least that the fucking agency could have done would have been to offer to pay my legal fees.

Eventual DMDC Response to Statistical Issues

In what might charitably be described as a pathetic attempt to cover the DMDC's shitty little ass, three very-well-compensated members of the DMDC's payroll conspired together to write a response to my comments that had been published in Armed Forces and Society. Yes, I know, one of the three is actually paid through a God-damned government contractor that skims off a significant profit. But, as far as anyone ought to be concerned, they are all just suckling at the federal teat. Federal employees are barred from lobbying and engaging in other political activities. Private contractors aren't, and consequently carry considerable clout.

They began with the patently false assertion that my observations had been based upon a “preliminary data set with incorrect weighting variables”, rather than a final “publicly-available” data set, and thus lacked “the necessary empirical support to be useful in improving the DMDC survey program.” Utterly reprehensible bullshit. First, the DMDC never makes any data sets "publicly available." The DMDC hardly even makes any survey results "publicly available", even on its fucking website. Second, I did use the right fucking data set. Moreover, their assertion that "it is clear that eligibility status adjustments should be as close to 1.00 as possible" is wrong, and demonstrates an utter lack of understanding of the basic subject matter. Most of the weight adjustments were close to 1.00, which would mean that response rates had to have been very close to 100% in most of the sampling strata. Response rates were nowhere near 100% in any of the fucking sampling strata. The DMDC's absurd weight-adjustment methods simply fucked up whatever credibility the survey might otherwise have had.

The three very-well-compensated members of DMDC's payroll proceeded to claim that “each iteration of the weighting process goes through vigorous quality control checks to assure accuracy of the weights.” Utter bullshit.

The three very-well-compensated members of DMDC's payroll went on to assert that the DMDC “expends a great deal of effort to ensure that its methods are in line with current research standards” and “is a firm believer and practitioner of continuous process improvement, including enhancements to its statistical methods.” As evidence, they stated that, in 2008, the DMDC began using the Chi-squared Automated Interaction Detector (CHAID) to “identify the best predictors for inclusion in the non-response logistic regression models.” CHAID is type of decision-tree method (based upon adjusted significance testing) that has been around for more than three decades[28], and is an exploratory technique that is an alternative to multiple linear regression and logistic regression, particularly when the data set is not well-suited to regression analysis.[29] In adjusting survey weights for non-response, CHAID (and other branching algorithms that have been developed more recently) is typically used to form weighting classes directly, thus avoiding the need for logistic-regression models.[30] Using CHAID to identify predictors for inclusion in non-response logistic regression models is highly irregular, and the three very-well-compensated members of DMDC's payroll really ought to reflect upon their approach.

The three very-well-compensated members of DMDC's payroll concluded that the “DMDC takes its mission to collect, analyze and report data to support the military community very seriously. While there may be steps that DMDC can take to advance this mission, the suggestions made in Losinger (2010) are not among them.” The problems that I had pointed out in the DMDC surveys comprised very basic mistakes with regard to issues that are well discussed in the statistics literature, and constitute relatively simple concepts that could be corrected by reviewing standard texts. Neither supreme nor extraordinary intelligence would be required. The clowns at the DMDC are just either too fucking proud or too fucking lazy to do it. The Secretary of Defense should either compel them to make the corrections, or simply shut down the program. The Undersecretary of Defense for Personnel and Readiness could just as well sacrifice a sheep and hire a haruspex to interpret the entrails for the agency's information needs. This would at least save the government millions of dollars that we don't have, and afford the Undersecretary's staff something to eat.

Sunday, July 27, 2008

The Census Bureau's Random Groups Standard Error Estimator

Rather than a simple random sample, the Census Bureau relied upon a systematic sampling method to select sample units (where a sample unit was defined as either a household or a person living in group quarters) for the 1990 Census sample. This made things easier for census takers--instead of trying to develop a random sample of households in their area, they would administer the long form questionnaire to every sixth household (or every eighth household, or every other household, depending on what part of the country they lived in). This procedure should (hopefully) result in a sample that was more reflective of the population as a whole, than a more haphazard scheme.

The formulas for the standard error a few blogs up were for a simple random sample. One thing that survey statisticians like to do is to come up with different formulas when the sampling method is other than simple random sampling. Generally, your target is a smaller standard error, which makes your confidence intervals appear smaller and your survey estimates more precise.

For estimation for the 1990 Census, the US was divided into just over 60,000 distinct weighting areas, or areas for which sample weights were derived. A "sample weight" is the number of units in the larger population which a sample unit represents (for weighting purposes). For example, if the sampling ratio for households is one-in-six, then the initial guess at how many population households a sampled household represents is six (itself plus five others). As mentioned previously, everyone answered certain questions (on the Short Form), and a systematic sample was selected for the Long Form. Information from questions which everyone answered was used to adjust the sample weights. For example, since everyone answered the question on "race", the sample weights were adjusted to make certain that the sample estimates of persons by "race" equaled the population count of people by race. This hopefully made the estimates from the sample data more reflective of the population.

For each of the 60,000 weighting areas, a standard error estimate, using the random-groups method, was computed for each of 1,804 sample data items.

Within each weighting area, sample units (a sample unit being either a housing unit or a person residing in a group quarter) were assigned systematically among 25 random groups.

For each of the 25 random groups, a separate estimate of the lotal for each of 1,804 sample data items was computed by multiplying the weighted count for the sample data item within the random group by 25. For each data item for which the total number of people with a particular characteristic was estimated from the sample data, the random-groups standard error estimate was then computed from the 25 different estimates of the total from the random groups.

This is hard to describe without formulas. You had better look this up in my paper:

http://losinger.110mb.com/documents/Random_Groups.pdf

For each data item within a weighting area, a design effect was calculated as the ratio of the random groups standard error estimator to the simple random sampling standard error for a one-in-six random sample (mentioned in my previous post).

For a state report of sample data, the design effects for eaeh data item were averaged across the weighting areas in the state. Then, a generalized design effect for each data item type (for example, all dala items that dealt with occupation) was computed. The generalized design effect was weighted in favor of data items that had higher population estimates.

In my paper, I present a hypothetical example of data that mighl have arisen from the random-groups method. For a weighting area in Vermont, weighted counts of Whites and Blacks are listed for the 25 random groups. In my hypothetical weighting area, there are no persons of other race. The standard errors assuming simple random sampling are the same for Whites and Blacks (as one would expect for a binomial variable). However, the random-groups standard error estimate is much higher for Whites than for Blacks. And, the design effect is nearly five times higher for the estimate of Whites than the estimate of Blacks. Since the generalized design effect computed for groups of data items was weighted in favor of data items that had higher population estimates, the generalized design effect computed for race for the state of Vermont was quite high.

Data on race were frequently included in 1990 U.S. census sample data products. Because race was asked of every census respondent (i.e., it was a census 100-percent data item), and because the weighting process used by the Census Bureau effectively forced the sample estimates by race to match the 100-percent Census counts by race, the standard errors for estimates of race probably should have bccn considered to be zero. However, generalized design effects were still published hy race, although set to arbitrary constants for all reports (rather than as computed by this method).

More on my proposed modification next time.

Saturday, July 26, 2008

Appendices provided for the 1990 Census

Appendix C, Accuracy of the Data is available here:

http://factfinder.census.gov/metadoc/stf3appc.pdf

Table C was left off. For Vermont, Table C is available in Table 1 of my paper:

http://losinger.110mb.com/documents/Random_Groups.pdf

Data users were instructed to determine a standard error for their Census estimate of interest, based upon a 1-in-6 sampling rate. They could accomplish this either by using a table provided in Appendix C, or by using a formula provided in Appendix C (similar to the formula above).

Then, data users were to multiply this result by a Design Factor provided in Table C. The Design Factor varied depending on the actual sampling rate (as mentioned in the Appendix, actual sampling rates varied from 1-in-8 to 1-in-2).

The Design Factor was an average of the ratios of the Random Groups standard-error estimator to the simple-random-sampling standard error (based on a 1-in-6 sampling ratio).

More on the Random Groups standard-error estimator next time.

Friday, July 25, 2008

The Finite Population Correction Factor

When making survey estimates for a finite population, the standard error is reduced by multiplying the standard error (from the previous post) by a Finite Population Correction Factor (FPC), which is:

FPC = sqrt((N - n)/N)

where N is the population size and n is the sample size. If your survey includes everyone in the population, a.k.a. a census, then N = n and your FPC = 0 . Thus, your standard error is zero, because you know everything about everyone, and there is no uncertainty.

In practice, the FPC can be ignored when the sampling fraction is less than 10%. However, in practice, and as a general rule, survey research organizations do not like to pass up an opportunity to make their estimates look better, and will not ignore the FPC.

For the 1990 Census, everyone responded to a certain number of questions, all of which were contained on the "Short Form." A sample of the population--the sampling rate varied across the country, but in most parts of the country the sampling rate was one-in-six--received a "Long Form" questionnaire. They answered all of questions that were contained on the Short Form, plus more detailed questions on variety of topics, such as how they commuted, what their occupation was, etc.

Including the FPC, the standard error for the estimate of the number of Whites becomes:

se(Nw) = FPC x N x sqrt(p x q/n)

= sqrt((N - n)/N) x N x sqrt(p x q/n)

= sqrt(1 - (n/N)) x N x sqrt(p x q/n)

Taking the sampling rate to be 1/6, or n/N = 1/6, this becomes:

se(Nw) = sqrt(1 - (1/6)) x N x sqrt(p x q/n)

= sqrt(5/6) x (N/sqrt(n)) x sqrt(p x q)

= sqrt(5/6) x sqrt(N/n) x sqrt(N) x sqrt(p x q)

= sqrt(5/6) x sqrt(6) x sqrt(Np) x sqrt(1 - p)

= sqrt(5 x Np x (1-p))

= sqrt(5 x Np x (1 - Np/N))

which is basically the result in the paper for the one-in-six simple-random-sample result. It may come out looking a bit more complicated than it needs to look.

Thursday, July 24, 2008

I will begin my blogging by first discussing the Random Groups Standard Error Estimation procedure, used by the US Census Bureau in the 1990 Census (in in previous Censuses), and which I described here: http://losinger.110mb.com/documents/Random_Groups.pdf

In keeping with the example of my paper, suppose that there is a geographic area that has people of only two races: White and Black. You know the total number of people. You want to get an idea of how many are White and how many are Black, without asking everyone in the area. So, you ask a sample of the people in the area whether they are White or Black. Within your sample, you know how the proportion of Whites and Blacks. To estimate the total number of Whites in the area, you simply multiply the proportion of Whites in your sample by the total number of people in area. Similarly for Blacks.

Suppose that P represents the true proportion of Whites in the population, and that Q represents the true proportion of Blacks in the population. The population of this area consists only of Blacks and Whites. Hence, the two proportions add to one (P + Q = 1).

If your sample was a simple random sample of the population, then your best estimate of P is p, i.e. the number of Whites in your sample, divided by your total sample size

p = nw / n

where nw is the number of Whites in your sample, and n is the total number of people (both Black and White) in your sample. Similarly,

q = nb / n,

where nb is the number of Blacks in your sample. Note here that p + q = 1.

Your estimate of the total number of Whites (Nw) in the population is N x p, and your estimate of the total number of Blacks (Nb) in the population is N x q.

The basic formula for the standard error of p is sqrt(p x q/n), which is the same as the formula for the standard error of q.

Sorry I don't have the ability to create and edit equations on this blog. Take sqrt to mean the square root.

If N is a fixed constant, then the standard error of the estimate of the total number of Whites in the population is:

se(Nw) = N x se(p) = N x sqrt(p x q/n).

The standard error of the estimate of the total number of Blacks in the population turns out to be identical:

se(Nb) = N x se(q) = N x sqrt(p x q/n).

However, one should bear in mind that, in sampling theory, there is something called the finite population correction factor, which comes into play especially when your sample is relatively large compared to your population. If you completed a census of everyone in your population, then you know exactly the number of Whites and Blacks in your population. Your numbers are not based on sampling, and there is no sampling error.

More on the Finite Population Correction Factor next time.

Tuesday, July 22, 2008

Welcome to my Blog

You can find details about me and my background by clicking here: http://losinger.110mb.com/

I am presently offering my services as a consultant, offering expertise in statistics, economics and epidemiology, after most of a lifetime spent in gathering this expertise.

My goal in this blog is to provoke thoughts, and to present ideas and information, that will be of interest to statisticians and to others who use statistics and statistical methods, whether applying analytic techniques or simply dealing with statistical stuff. Hence, I'm hoping that economists, epidemiologists, physicists, political scientists, housewives who watch Oprah, and others will find my blog to be intriguing.

I will talk not only about statistical methods, but also about the political and social structures to which people who perform statistical functions must, in one way or another adapt.