Some of the theories around human judgment and decision-making suggest that humans find it hard to be consistent. Every so often I’ve had a chance to assess my own consistency, and confirmed that yes, I’m very human! Tonight I had a chance to re-test that theory on myself.
I just finished reviewing 11 conference abstracts. There are some good topics I’m looking forward to: I hope the other reviewers and the Program Committee agree with me about accepting them Reviewing and submitting my ratings on each abstract took me an average of 17 minutes. Since I did them in my spare time, in subsets over the past two weeks, it seemed possible I hadn’t judged the first subset or first few quite the same way as I did the others, due to ‘decision fatigue’ or other factors. So before I called my opinions ‘final’, I decided to I’d do a quick check on whether I’d been consistent.
The review process calls for 3 binary judgments, yes/no, where no indicates a submittal isn’t a good fit in some way; 6 Likert scale ratings from 1-5, where 1=strongly disagree, 2=disagree, 3=neutral, 4=agree, 5=strongly agree; specific comments to the author and the committee, and a final vote (reject, weak reject, neutral, weak accept, accept). So I put all 11 entries into a spreadsheet, with my yes/no and 1-5 ratings, and my votes.
The first sanity check I did was to calculate an average and median for 5 of the Likert values, to compare vs. the 6th which was “overall”. I was inconsistent in 1 of the 11. After re-reading it and comparing it to the others, I adjusted the overall score upward 1 notch to make it align better with the 5 specific ratings.
Then I did a quick pivot table, cross-tabulating my overall ratings against my votes. This let me see whether, for the same vote, I had a range of overall ratings, or if they all had the same rating. Of the 11, it seemed at first that I was inconsistent on 3. For instance, 3 entries initially had the same overall rating, but 1 of the 3 had a different vote – the explanation was that it had a NO which made it (IMHO) not acceptable, while the other 2 had only YESes. In another two cases, I had given different votes to two abstracts rated the same. I re-read them to see whether I thought my vote was wrong, my rating was wrong, or if there were other factors. I ended up adjusting the overall rating on one, the vote on the other.
What does this prove? Well, not much – it’s a small, and not-unbiased, study in decision-making! But I did draw two conclusions.
1. Even with good guidance and clear rating scales to guide me, and experience as a reviewer, I am still humanly inconsistent.
2. Doing just a wee bit of simple analysis gave me more uniform ratings and votes.
I think it was well worth the 15 minutes it took to build the spreadsheet, and the additional 20 minutes to review, adjust, update, and submit my revised ratings. I feel more confident that I made better judgments, and I figure I owe it to the submitters to try to be as objective as possible in assessing their hard work!
The final validation, of course, will be seeing how well my votes align with the program committee’s final choices. For that, I’ll have to wait for the program to be announced …