Ruling a diagnosis in or out with "SpPIn" and "SnNOut": a note of caution(百拇医药)

Ruling a diagnosis in or out with "SpPIn" and "SnNOut": a note of caution

http://www.100md.com 《英国医生杂志》

     1 Division of Epidemiology and Biostatistics, Department of Social and Preventive Medicine, University of Bern, Switzerland, 2 Department of General Internal Medicine, Inselspital, University of Bern, Switzerland, 3 Basel Institute for Clinical Epidemiology, University Hospitals, Basel, Switzerland, 4 MRC Health Services Research Collaboration, Department of Social Medicine, University of Bristol, Bristol

    Correspondence to: M Egger, Department of Social and Preventive Medicine, University of Bern, Finkenhubelweg 11, CH-3012 Berne, Switzerland egger@ispm.unibe.ch

    Introduction

    In this article, we examine examples of test evaluation studies that websites and a textbook of evidence based medicine4 have cited as showing that the tests had SpPIn or SnNOut properties. The studies were chosen to illustrate methodological issues. We assessed the quality of studies as described elsewhere,8 extracted the two-by-two table from the original publication, and calculated likelihood ratios and post-test probabilities with exact binomial 95% confidence intervals based on the pretest probabilities observed in the studies. Finally, we examined whether the post-test probability of the condition in question in the population studied was compatible with the notion of safely ruling the condition in or out, and considered the transferability of study results to other settings and populations. Tables 1 and 2 summarise the study characteristics and results from our critical appraisal.

    Table 1 Characteristics of test evaluation studies interpreted as demonstrating SnNOut properties (high sensitivity and negative result rules out)

    Table 2 Characteristics of test evaluation studies interpreted as demonstrating SpPIn properties (high specificity and positive result rules in)

    Box 2: CAGE questionnaire for detecting alcohol misuse

    Have you ever felt you should Cut down on your drinking?

    Have people Annoyed you by criticising your drinking?

    Have you ever felt bad or Guilty about your drinking?

    Have you ever had a drink first thing in the morning to steady your nerves or to get rid of a hangover (Eye opener)?

    Random error and bias

    A diagnostic study may be too small to define test performance with sufficient precision. For example, a website2 and the textbook4 interpreted a study of ankle swelling in patients with suspected ascites10 as demonstrating SnNOut properties. The absence of a history of ankle swelling is thus assumed to rule out ascites.2 4 However, the study was based on only 15 patients with ascites and confidence intervals were wide, with the lower 95% confidence interval of the sensitivity including 68%. This means that absence of ankle swelling is still compatible with a 15.8% probability of ascites, which clearly is unacceptably high (table 1).

    Studies with methodological flaws tend to overestimate the accuracy of diagnostic tests.21 Bias can be introduced when tests are evaluated in patients known to have the disease and in people known to be free of it—so called diagnostic case-control studies. In this situation patients with borderline or mild expressions of the disease, and conditions mimicking the disease are excluded, which can lead to exaggeration of both sensitivity and specificity.21 This is called spectrum bias because the spectrum of study patients will not be representative of patients seen in practice. For example, the textbook considered auscultatory percussion in the diagnosis of pleural effusion as a SpPIn test.4 This assessment was based on a study that compared patients who were selected because of the presence or absence of radiological signs of effusion.16 The impressive results (100% specificity and 96% sensitivity (table 2)) may therefore not be reliable. The textbook and a website22 also claim that the presence of retinal vein pulsation in ophthalmoscopy excludes increased intracranial pressure (a SnNOut test). This is based on a study that compared patients known to have increased pressure with people not suspected to have increased intracranial pressure.11

    Partial verification bias may be introduced when the reference test or tests are not applied consistently to confirm negative results of the index test. Some patients are either excluded or considered true negatives. This may lead to overestimation of sensitivity and underestimation of specificity or to overestimation of sensitivity and specificity.21 The textbook considered the CAGE questionnaire for diagnosing alcohol misuse (box 2) to be a SpPIn test.4 This is based on a study that subjected only a fraction of CAGE-negative persons to further testing (liver enzymes, medical record review, and physician interviews (table 2)), 17 thus possibly introducing bias.

    Similarly, incorporation bias may be present if the test under evaluation is also part of the reference test.21 This will lead to overestimation of test accuracy because experimental and reference tests are no longer independent. For example, a website listed abdominojugular reflux for the diagnosis of congestive heart failure as a SpPIn test,22 on the basis of a study that used clinicoradiographic criteria, including abdominojugular reflux, as the reference test (table 2).18

    Sensitivity and specificity

    The likelihood ratio associated with a negative test result does not depend on its sensitivity alone, as suggested by the SnNOut rule, but also on its specificity. For example, a website considered that the clinical criteria for the diagnosis of Alzheimer's disease had SnNOut properties2 based on a sensitivity of 93% (table 1).12 However, despite this high sensitivity, the likelihood ratio of a negative test was a modest 0.3, because of the test's low specificity of 23% (100 - 93/23 = 0.3, see box 1). Indeed, in the population studied, the probability of Alzheimer's disease, given a negative test, was 25% (table 1). The power to rule out a diagnosis thus depends on both sensitivity and specificity.

    Similarly, the ability to rule in depends not only on specificity, as suggested by the SpPIn rule, but also on sensitivity. A study examining the presence of a third heart sound in the diagnosis of congestive heart failure (table 2)19—which a website interpreted as demonstrating SpPIn properties23—is an example of a highly specific test (99%) that suffers from a low sensitivity (24%). The figure shows how the power to rule a disease in or out is eroded when highly specific tests are not sufficiently sensitive, or highly sensitive tests are not sufficiently specific.

    Ability of a test to rule in disease (measured by the likelihood ratio of a positive test) as a function of specificity and sensitivity (upper panel), and ability of a test to rule out disease (measured by the inverse of the likelihood ratio of a negative test result) as a function of sensitivity and specificity (lower panel)

    Transferability and applicability

    The performance of a diagnostic test often varies considerably from one setting to another, which may be due to differences in the definition of the disease, the exact nature of the test, and its calibration and the characteristics of those with and without the disease in a given setting.21 For example, patients attending primary care practices will generally have disease at an earlier stage than patients in secondary and tertiary care, which may reduce a test's sensitivity. Patients free of the disease in tertiary care will tend to have other conditions, which could reduce the specificity of a diagnostic test. Interpreting data on a test's accuracy thus requires defining the exact nature of the test used, the disease, and the patient population studied. For example, the website that listed the CAGE questionnaire as a SpPIn test for alcohol dependence2 cited, as the evidence for this, a study that had been performed in black women admitted to a trauma centre in the United States,20 which may not be applicable to other populations and settings.

    Even when we assume that sensitivity and specificity do not change between settings and patient populations, test results will have different interpretations depending on whether a test is performed in a low risk population, such as in primary care, or high risk patients in a referral centre. For example, in the study evaluating the third heart sound in the diagnosis of heart failure,19 the pretest probability or prevalence in a general practice setting was 16%. In this situation, a positive test with a likelihood ratio of 18 will not allow the diagnosis to be ruled in with confidence: the post-test probability is only increased to 77% (table 2). If the pretest probability were 50%, however—such as in a cardiology outpatient clinic—the same positive test would produce a post-test probability of 95% (see box 1 for formula).

    The interpretation of studies will be strongly influenced by the nature of the condition and the invasiveness of further investigations. For example, a study assessing urinary albumin:creatinine ratios below 1.8 g/mol for ruling out microalbuminuria in men with type 2 diabetes in primary care13 and the study examining the absence of a history of ankle swelling for ruling out ascites in men admitted to general internal medicine wards10 both produced post-test probabilities of about 3%. In the first case, we accepted a website's conclusion that the urinary albumin:creatinine ratio had SnNOut properties2: we thought that the post-test probability of microalbuminuria was sufficiently low with a negative test result, considering that guidelines recommend regular testing of patients with type 2 diabetes.24 In the second case, however—and unlike the textbook4—we thought that in men with suspected ascites but no history of ankle swelling a probability of ascites of 3%, a sign often associated with serious conditions, was still too high and that sonography should be used to rule the diagnosis in or out.25 As mentioned above, another problem with this study is the small sample size, which resulted in wide confidence intervals.

    Conclusions

    Stiell IG, Greenberg GH, McKnight RD, Nair RC, McDowell I, Worthington JR. A study to develop clinical decision rules for the use of radiography in acute ankle injuries. Ann Emerg Med 1992;21: 384-90.

    Centre for Evidence-Based Medicine, University Health Network, Mount Sinai Hospital Toronto. Sensitivity & specificity (SnNouts and SpPins). www.cebm.utoronto.ca/glossary/spsn.htm (accessed March 2004).

    Fagan TJ. Nomogram for Bayes theorem. N Engl J Med 1975;293: 257.

    Sackett DL, Straus SE, Richardson WS, Rosenberg W, Haynes RB. Evidence-based medicine. How to practice and teach EBM. New York: Churchill Livingstone, 2000.

    Barry HC, Ebell MH. Test characteristics and decision rules. Endocrinol Metab Clin North Am 1997;26: 45-65.

    Sackett DL, Haynes RB, Guyatt GH, Tugwell P. Clinical epidemiology, a basic science for clinical medicine. Boston: Little Brown, 1992.

    Schwartz A, Hupert J. Medical students' application of published evidence: randomised trial. BMJ 2003;326: 536-8.

    Horvath AR, Pewsner D. Systematic reviews in laboratory medicine: principles, processes and practical considerations. Clin Chim Acta 2004;342: 23-39.

    Auleley GR, Kerboull L, Durieux P, Cosquer M, Courpied JP, Ravaud P. Validation of the Ottawa ankle rules in France: a study in the surgical emergency department of a teaching hospital. Ann Emerg Med 1998;32: 14-8.

    Simel DL, Halvorsen RA Jr, Feussner JR. Quantitating bedside diagnosis: clinical evaluation of ascites. J Gen Intern Med 1988;3: 423-8.

    Levin BE. The clinical significance of spontaneous pulsations of the retinal vein. Arch Neurol 1978;35: 37-40.

    Varma AR, Snowden JS, Lloyd JJ, Talbot PR, Mann DM, Neary D. Evaluation of the NINCDS-ADRDA criteria in the differentiation of Alzheimer's disease and frontotemporal dementia. J Neurol Neurosurg Psychiatry 1999;66: 184-8.

    Bakker AJ. Detection of microalbuminuria. Receiver operating characteristic curve analysis favors albumin-to-creatinine ratio over albumin concentration. Diabetes Care 1999;22: 307-13.

    Bachmann LM, Kolb E, Koller MT, Steurer J, ter Riet G. Accuracy of Ottawa ankle rules to exclude fractures of the ankle and mid-foot: systematic review. BMJ 2003;326: 417.

    McKhann G, Drachman D, Folstein M, Katzman R, Price D, Stadlan EM. Clinical diagnosis of Alzheimer's disease: report of the NINCDS-ADRDA Work Group under the auspices of Department of Health and Human Services Task Force on Alzheimer's Disease. Neurology 1984;34: 939-44.

    Guarino JR, Guarino JC. Auscultatory percussion: a simple method to detect pleural effusion. J Gen Intern Med 1994;9: 71-4.

    Bush B, Shaw S, Delbanco TL, Aronson MD. Screening for alcohol abuse using the CAGE questionnaire. Am J Med 1987;82: 231-5.

    Marantz PR, Kaplan MC, Alderman MH. Clinical diagnosis of congestive heart failure in patients with acute dyspnea. Chest 1990;97: 776-81.

    Davie AP, Francis CM, Caruana L, Sutherland GR, McMurray JJ. Assessing diagnosis in heart failure: which features are any use? QJM 1997;90: 335-9.

    Cherpitel CJ, Clark WB. Ethnic differences in performance of screening instruments for identifying harmful drinking and alcohol dependence in the emergency room. Alcohol Clin Exp Res 1995;19: 628-34.

    Whiting P, Rutjes AW, Reitsma JB, Glas AS, Bossuyt PM, Kleijnen J. Sources of variation and bias in studies of diagnostic accuracy: a systematic review. Ann Intern Med 2004;140: 189-202.

    Centre for Evidence Based Medicine, University of Oxford. SpPins and SnNouts. www.cebm.net/sppins_snnouts.asp (accessed January 2003).

    Centro per la formazione permanente e l'aggiornamento del personale del servizio santiario (CEFPAS), Caltanissetta, Italy. http://www.cefpas.it/ebm/diagnosis/dia_appraising.htm (accessed March 2004).

    National Electronic Library for Health. Full-text guideline collection. Clinical guidelines for type 2 diabetes: renal disease—prevention and early management. www.nelh.nhs.uk/guidelinesdb/html/fulltextsummary/RcgpRenal.html (accessed March 2004).

    Williams JW Jr, Simel DL. The rational clinical examination. Does this patient have ascites? How to divine fluid in the abdomen. JAMA 1992;267: 2645-8.

    Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP, Irwig LM, et al. Towards complete and accurate reporting of studies of diagnostic accuracy: the STARD initiative. BMJ 2003;326: 41-4.(Daniel Pewsner, senior re)

http://www.100md.com/html/DirDu/2007/01/23/35/56/90.htm