Experts and Monkeys

Posted on June 30, 2011


The assessment of exposure and the estimation of exposure intensities or patterns is an intrinsically difficult exercise. Not only do the levels of exposure differ over time or across space, in many cases the components themselves also change; for example when different sources of specific exposures are turned on or off or in during the formation of secondary aerosols in the general environment resulting from reactions of atmospheric gases and condensation processes. And this is for those situations where data from actual measurements are available to provide some indication of exposure. This, unfortunately, is not always the case, and especially in retrospective studies may cause serious headaches. A possible solution to this is to use professional judgment, or ‘experts’, to estimate exposure levels for different situations or time periods or to provide a ranking of subjects, tasks or jobs –based on limited information only (obviously, if there would be enough information the whole expert evaluation would not be required). But how good are these estimates? Being one of those experts (presumably depending on who you’d ask, I imagine) I like to imagine that “we” are at least better than the proverbial “monkey throwing darts at a dartboard” (or in more scientific terms, drawing a random number) – although data has shown, not by very much.

For illustration, here is a recently published study by scientists from the US National Cancer Institute and the Shanghai Municipal Center for Disease Control (here). They had six experts (3 occupational hygienists and 3 occupational physicians) estimate occupational exposure to dust among foundry workers and textile workers based on a relatively crude measure (a self-reported questionnaire on occupational factors and history) and on a somewhat more sophisticated questionnaire that included additional specific questions about tasks and work locations in current jobs. The authors compared how well the experts agreed with each other (expressed by the intraclass correlation (ICC) (more info here)) and how well the estimates correlated to “real” measurement data.

   So how good are “we”? Well, the good news is – not too bad actually. And we definitely beat   that monkey and his darts…well mostly; especially when we take the group averages of the assessments instead of the individual assessment. When we attach numbers to it, we can see that the ICC ranges from 0.47 to 0.72 for the foundry workers and from 0.34 to 0.65 for the textile workers for individual assessments, but these go up to 0.84-0.94 and 0.76-0.92 when grouped   assessments are used. Interestingly, these data further show, contra-intuitively, that more information doesn’t actually help. In contrast, the extra information from the additional more sophisticated questionnaire seems to increase the “expert-confusion”, resulting in reduced ICCs.

The correlations between the actual exposure measurements and either the individual raters’ are not very convincing with correlations (Spearman’s for those interested) of about 0.65 for the foundry workers and about 0.40 for the textile workers (again being somewhat better when less information is provided). However, when using aggregated ratings the correlations improve to about 0.70 and 0.50; so not bad really…

There are some other interested findings, such as (not surprisingly) that it is easier to put people in the right exposure class when there are only two classes compared to four, but please read the paper (here). However, what can we learn from this? Besides the comforting fact that us experts, are, in most cases, somewhat better than monkeys with darts, the authors emphasize the fact that using more experts will reduce misclassification and improve the assessment. So essentially, never trust “single experts”. To me, that seems to be a useful lesson for life in general…