Evaluating my 2023 ACX forecasting contest results
79th percentile, doesn't get much better than that
In late 2022, Scott Alexander (Astral Codex Ten) asked participants in his 2023 prediction contest to give probabilistic forecasts for 50 yes/no questions about 2023. The ‘blind’ mode – which is the only mode I will talk about in this post – asked participants to spend no longer than 5 minutes researching each question, and to not look at any prediction markets before entering the forecast.
The winners were announced a few weeks ago (warning: ACX post, very laggy due to Substack’s horrifically awful comment-rendering code), and a few days ago I received the question-by-question breakdown of my results by email.
This post is divided into three sections: comments on the scoring function, comments on the sociology of forecasting in the LW/EA space, and comments on my forecasts and those of two people who did better than me.
The scoring function
In his post announcing the winners, Scott said that “the” Metaculus scoring function was used, linking to an FAQ page that describes at least three possible scoring functions that might have been used. With my detailed results, the CSV file of all forecasts, and the statistics in the winners post, I was able to conclude that a participant’s overall score was calculated as the sum of their Peer scores, divided by the number of questions they answered. I’ll go through what this means so that I can get to why I think this was wrong.
Let p be the forecast probability of an outcome; here we’re only imagining a single question in the contest. The log score L is defined as log(p); all log scores are negative, and closer to zero is better. Let L_mean be the mean of all participants’ log scores for this question. The Peer score is defined as L - L_mean (multiplied by a factor of N/(N - 1)): if your log score is higher than the mean, then you have a positive Peer Score.
All of this is fine. The log score punishes misplaced overconfidence: as p goes to zero (i.e. you very confidently predict the wrong outcome), the score log(p) goes to minus infinity. The choice of L_mean as the baseline for the Peer score is irrelevant to the difference between two participants’ overall scores – and hence irrelevant to the rank order of the results – as long as they both answer all questions.
The trickier part is what to do when someone doesn’t answer all questions. Metaculus calculates the overall score by summing the Peer scores, effectively assigning a Peer score of zero to a non-answer. Now the choice of the Peer score’s baseline does make a difference: the median log score for a question is probably not so bad, but the mean can be blown out by some participants being overconfident and wrong.
In the past, Metaculus used to use the median as the baseline (calling the result the “Relative score”), which at first glance seems reasonable. But they’ve switched to using the mean instead, and I think I see why. Suppose a question attracts some overconfident wrong forecasts. If you use the median as a baseline, then this minority of bad forecasts doesn’t matter: your Peer/Relative score of zero puts you in amongst the not-so-bad forecasts. But if you use the mean as a baseline, then the zero point is much lower, dragged down by those logarithms of small forecast probabilities. You get punished if you avoid answering a tough question.
Unfortunately, the ACX winners post did not use this good Metaculus scoring function, in which the Peer scores are summed. Instead, the system used sums a participant’s Peer scores and divides by the number of questions that participant answered – throwing out the non-answers entirely. This is like if the Metaculus system was applied, but instead of assigning zero to a non-answer, the participant’s average Peer score on the answered questions is applied. This is extremely generous to a participant’s non-answers if that participant does better than average on the questions that they do answer.
In an update post (ACX link), Scott talks of an ambiguity in the scoring criteria, giving an alternative set of winners. My guess is that the ambiguity is what I’ve described above, and that the alternative set of winners are the appropriate ones, derived from a correct application of the Metaculus scoring function. I don’t think it’s a coincidence that the original first-place-getter is obscure, whereas the alternative winner has an excellent track record of forecasting on Metaculus.
According to my calculation, using the correct Metaculus scoring, I finished 701st out of 3284 participants who submitted at least 13 forecasts out of 50, putting me in the 79th percentile. Scott himself answered all 50 questions, so his original post undersold how well he did: instead of finishing 400th1, I have him finishing 282nd. The graph in the winners post shows the median superforecaster finishing in the 70th percentile; with the correct scoring, I have it as the 75th percentile.
And, full disclosure, I scored in the 74th percentile with the original scoring, so the correct scoring makes me look better.
My python code to do the scoring is available on Github.
Sociology
Probabilistic forecasting, together with an associated obsession with prediction markets, occupies a strangely prominent place in the culture surrounding LessWrong. Eliezer Yudkowsky espouses philosophical Bayesianism – representing degrees of belief by probabilities – and this concept has spread to the point where some people will pepper their writing with subjective probabilities. “I think I’ll get a burrito for lunch today (75%)”, that kind of thing.
Making your probabilities well-calibrated is part of becoming less wrong in your beliefs, and being able to distinguish between fine gradations of likely/more likely/very likely/etc. makes you more accurate still.
I don’t know if forecasting the future of AI was a big motivation for studying forecasting in general, but AI forecasts (when will we get AGI? Will it kill all humans?) have long been a thing in the LW world, and AI risk was the founding motivation for LW in general.
The partial overlap between the LessWrong and Effective Altruism communities means that Bayesian probabilities can be found in EA culture as well. A remarkable example of this was a very detailed analysis of the Moore v Harper US Supreme Court case, written by a member of the Samotsvety forecasting group a few months before the case was decided. This does not look like effective altruism to me – people would learn the Court’s ruling soon enough – but the post got moderately upvoted because it was high-quality forecasting.
Prediction markets (betting markets) offer a mechanism by which society can converge on an objective probability, a sort of money-weighted conventional wisdom. If market prices don’t reflect the correct probabilities, then money can be made by someone who has a more accurate model. I assume it was Robin Hanson who brought these efficient-markets arguments to the LW-sphere. Hanson, like Scott Alexander, is a bit of a prediction markets maximalist.
The US government basically regulates prediction markets out of existence in all areas except sports and finance. This is very strange for the Land of the Free, and it means that much prediction market activity talked about happens with play-money (or crypto). There is no reason to expect that play-money markets should be efficient, but enough people care about their Internet points that they basically work anyway. If you want to know the probability of some event happening, then you’ll usually be better served looking at a Metaculus “market” than asking me.
There’s presumably some benefit for a group of hobbyist forecasters to risk play-money rather than real money. Losing real money can hurt, and people can have moral objections to gambling. Putting my money on the line for sure forces me to concentrate on what I’m doing, but it can be stressful. I haven’t bet for several years. Play-money democratises the prediction game.
But I wonder if the US regulation has led Americans to over-estimate the power and usefulness of real-money prediction markets. If people were to list key differences in the politics and society of the US, the UK, and Australia, the US’s lack of prediction markets for non-sport, non-finance topics would not rate a mention.
It’s nice to be able to look up a market probability of who will win the next federal election, but that’s all it is. It’s fine. You can, if you want, get your political horse-race news by checking the betting markets, but this option doesn’t noticeably change the tenor of public debate or journalism.
And there is a lot more to politics than the horse race. Anyone who cares about the “who’s gonna win?” question is probably interested in lots of other aspects of politics and policy – there’s plenty of debate to be had in what laws or regulations are introduced or repealed, who the changes would affect and how, and so forth. It is not realistic to expect esoteric conditional-probability prediction markets to answer these questions, even if they could exist in theory.
Unless you’re in charge of lots of money and need to hedge against some potential policy change, it doesn’t really matter whether the probability of something is 60% or 35%, except to your sense of optimism or pessimism.
Nevertheless, probabilistic forecasting tickles the brain of a certain type of numbery person, and I am such a person. I’m sure I’ll be checking the US presidential election forecast models or markets every day by October, and I enjoyed taking part in the ACX prediction contest. My competitive instincts were satisfied by an above-median performance, and I am annoyed at how Samotsvety are better than me at this.
My forecasts
Below I go through my forecasts question by question, trying to remember what I was thinking, and maybe commenting on the outcome. For comparison, I also have Scott’s forecasts (from the email), and the winner’s forecasts; I assume these are by Datscilly, but my assumption relies on me being correct about the two scoring systems used.
The percentages are the forecast probabilities, and the numbers in brackets are the Peer scores.
1. Will Vladimir Putin be President of Russia? Outcome: Yes David: 90% (0.17) Scott: 80% (0.05) Winner: 95% (0.23) Tough to say much about this in hindsight; I wouldn't be as bold as the winner. He was 70 at the start of 2023, so unlikely to die of natural causes. The war had been going for almost a year and his grip on power seemed strong. 2. Will Ukraine control the city of Sevastopol? Outcome: No David: 7% (0.52) Scott: 20% (0.37) Winner: 7% (0.52) For a while I was reading the daily updates from the Institute for the Study of War, so I had a decent feel for the (lack of) progress of the war. Decent forecast IMO. 3. Will Ukraine control the city of Luhansk? Outcome: No David: 20% (0.53) Scott: 30% (0.40) Winner: 17% (0.57) As above. 4. Will Ukraine control the city of Zaporizhzhia? Outcome: Yes David: 90% (0.41) Scott: 93% (0.45) Winner: 95% (0.47) As above. 5. Will there be a lasting cease-fire in the Russia-Ukraine war? Outcome: No David: 65% (-0.36) Scott: 64% (-0.33) Winner: 28% (0.36) I'm guessing – I can't remember for myself – that Scott and I had a look through the lengths of recent major wars and converged on a conclusion that it was more likely to end. This is the first real case of the winner's judgement being substantially better than mine. 6. Will the Kerch Bridge be destroyed, such that no vehicle can pass over it? Outcome: No David: 50% (-0.18) Scott: 60% (-0.40) Winner: 25% (0.23) Ukraine had hit the bridge in October 2022, reducing traffic but not cutting it all off. My 50/50 guess still feels reasonable, and it seems hard to draw any lessons from the winner being correctly more sceptical. 7. Will an issue involving a nuclear power plant in Ukraine require evacuation of a populated area? Outcome: No David: 4% (0.35) Scott: 5% (0.34) Winner: 18% (0.19) Eh, it seemed unlikely. 8. Will a nuclear weapon be detonated including tests and accidents? Outcome: No David: 3% (0.25) Scott: 40% (-0.23) Winner: 4% (0.24) Wow, these are eye-poppingly different forecasts. I think I was lucky here – a North Korean nuclear test (their last one was in 2017) would have been sufficient for a 'Yes' outcome, and that should have put me over 10% at least. Scott's forecast seems way too high to me; I guess he was thinking of Russia, but I had faith that Putin's nuclear talk was a bluff. 9. Will a nuclear weapon be used in war ie not a test or accident and kill at least 10 people? Outcome: No David: 2% (0.06) Scott: 4% (0.04) Winner: 2% (0.06) No-one getting many points here. 10. Will China launch a full-scale invasion of Taiwan? Outcome: No David: 3% (0.12) Scott: 3% (0.12) Winner: 4% (0.11) As above, but slightly more interesting. You always want to try to find a base rate if possible for forecasting questions, and there's some discussion of how Samotsvety approached this in the Vox profile on them. 11. Will any new country join NATO? Outcome: Yes David: 88% (0.54) Scott: 83% (0.48) Winner: 78% (0.42) Finland and Sweden had already applied for membership, and as I recall, the only potential sticking point was opposition from Turkey. 12. Will Ali Khameini cease to be Supreme Leader of Iran? Outcome: No David: 10% (0.42) Scott: 24% (0.25) Winner: 16% (0.35) I can't remember if I looked up any actuarial tables for this. Maybe if I had I'd have gone a little higher, I don't know. I might have been lucky. 13. Will any other war have more casualties than Russia-Ukraine? Outcome: No David: 9% (0.18) Scott: 8% (0.19) Winner: 5% (0.22) I guess Scott and I both looked up a list of wars and did some counting to get a base rate. 14. Will there be more than 25 million confirmed COVID cases in China? Outcome: No David: 60% (0.33) Scott: 55% (0.45) Winner: 77% (-0.22) I think China stopped counting? 15. Will prediction markets say Joe Biden is the most likely Democratic nominee for President in 2024? Outcome: Yes David: 90% (0.51) Scott: 65% (0.19) Winner: 70% (0.26) He's the incumbent president, come on. 16. Will prediction markets say Gavin Newsom is the most likely Democratic nominee for President in 2024? Outcome: No David: 4% (0.24) Scott: 15% (0.12) Winner: 10% (0.17) Newsom was not the incumbent president. 17. Will prediction markets say Donald Trump is the most likely Republican nominee for President in 2024? Outcome: Yes David: 25% (-0.23) Scott: 40% (0.24) Winner: 63% (0.69) This was near the time of peak DeSantis. In my limited experience of following them, American primaries are very difficult to predict, since the elections are relatively low-profile and there is little polarisation within each party to keep the polls stable. I've occasionally bet on primary elections, but I've lost money on them overall. I'm pretty sure my forecasts were similar to the betting market odds for the Republican nominee. 18. Will prediction markets say Ron DeSantis is the most likely Republican nominee for President in 2024? Outcome: No David: 50% (0.16) Scott: 55% (0.05) Winner: 27% (0.54) The winner following in the footsteps of Matt Bruenig's classic Twitter thread, which has long since been deleted but went something like: "1/ Trump is gonna win". 19. Will the Supreme Court rule against affirmative action? Outcome: Yes David: 70% (0.24) Scott: 85% (0.43) Winner: 85% (0.43) I can't remember anything of how I came to my 70% figure. 20. Will there be any change in the composition of the Supreme Court? Outcome: No David: 12% (0.35) Scott: 10% (0.37) Winner: 50% (-0.21) I guess Scott and I looked to history (or actuarial tables?), whereas the winner maybe made a wrong guess on a strategic retirement? 21. Will Donald Trump make at least one tweet? Outcome: Yes David: 50% (0.18) Scott: 33% (-0.23) Winner: 70% (0.52) Hahaha, I love this question: "whaddaya reckon?"-style forecasting in its purest form. 22. Will Joe Biden have a positive approval minus disapproval rating? Outcome: No David: 12% (0.25) Scott: 13% (0.24) Winner: 17% (0.19) He was underwater already and I assume I found some archive of poll graphs to see that approval polls usually don't swing back. 23. Will Donald Trump get indicted on criminal charges? Outcome: Yes David: 20% (-0.37) Scott: 62% (0.76) Winner: 40% (0.32) Well done to Scott here. I thought the judicial system probably wouldn't touch him. 24. Will a major US political figure be killed or wounded in an assassination attempt? Outcome: No David: 7% (0.22) Scott: 40% (-0.22) Winner: 10% (0.19) Strange from Scott, there haven't been many assassinations in the US recently. 25. Will Rishi Sunak be Prime Minister of the UK? Outcome: Yes David: 60% (0.02) Scott: 50% (-0.16) Winner: 80% (0.31) Pretty tough question IMO. Sunak had only been PM for a couple of months, and was the third PM in the year. Usually Prime Ministers last at least a year, but the Tory polling was terrible, and it evidently wouldn't have shocked me if he'd been ousted. The winner's boldness was rewarded. 26. Will the UK hold a general election? Outcome: No David: 15% (0.31) Scott: 2% (0.45) Winner: 25% (0.18) Given the chaos of the UK government at the time, I think Scott's 2% on an early election was way too low. Obviously with the Tories polling badly, it'd be in their interests not to go early if they could hold themselves together, but I don't feel bad about my 15% here. 27. Will Elon Musk remain owner of Twitter? Outcome: Yes David: 50% (-0.36) Scott: 75% (0.05) Winner: 88% (0.21) Is there a base rate that covers Elon? Don't know how I feel about this one. 28. Will Twitter's net income be higher in 2023 than in 2022? Outcome: No David: 30% (0.40) Scott: 60% (-0.16) Winner: 30% (0.40) I think(?) the big deal was a loss of advertising revenue and maybe interest payments? Anyway I am not a financial analyst but I did OK here. 29. Will Twitter's average monetizable daily users be higher in 2023 than in 2022? Outcome: No David: 70% (-0.26) Scott: 60% (0.03) Winner: 65% (-0.11) Interesting, I've more or less forgotten what my thinking was at the time. Did I find any graphs of users, and the trend went into reverse? My forecast was in good company at least. 30. Will US CPI inflation for 2023 average above 4%? Outcome: No David: 85% (-0.64) Scott: 29% (0.91) Winner: 20% (1.03) America was really good at the economy last year. This question might have depended on some definitional pedantry, since this page from the Minneapolis Fed gives 4.1% for 2023, but the FRED graph shows about 3%. Still, the rapid fall in inflation in the US surprised me, and morally speaking I deserved to lose points on this question. 31. Will the S&P 500 index go up over 2023? Outcome: Yes David: 73% (0.25) Scott: 66% (0.15) Winner: 75% (0.28) This is just looking up year-on-year S&P numbers and counting how many times it went up. 32. Will the S&P 500 index reach a new all-time high? Outcome: No David: 25% (0.26) Scott: 50% (-0.15) Winner: 25% (0.26) As above; I'm not sure why Scott went so high. 33. Will the Shanghai index of Chinese stocks go up over 2023? Outcome: No David: 65% (-0.14) Scott: 50% (0.22) Winner: 40% (0.40) As above; maybe the winner correctly adjusted for some covid economic hangover. 34. Will Bitcoin go up over 2023? Outcome: Yes David: 60% (0.43) Scott: 66% (0.53) Winner: 75% (0.66) As above. Crypto pricing is absurd – there are no fundamentals, it's all based on people guessing what other people think people will pay for it. Anyway line went up. 35. Will Bitcoin end 2023 above $30,000? Outcome: Yes David: 35% (0.62) Scott: 37% (0.67) Winner: 30% (0.46) Line went up. Bitcoin had peaked over $60k in 2021, so going back over $30k was "in sample". 36. Will Tether de-peg? Outcome: No David: 35% (0.47) Scott: 25% (0.62) Winner: 15% (0.74) I don't think I tried to get a base rate for stablecoins losing the peg; but should the base rate be all high-profile stablecoins, or just Tether and its history? Late 2022 was a good time for crypto drama, and maybe I let the vibes take my forecast too high. Tether still seems suspicious. 37. Will the US unemployment rate now 3.7% be above 4% in November 2023? Outcome: No David: 13% (0.91) Scott: 65% (0.00) Winner: 15% (0.89) In most years, unemployment doesn't go up that much. 38. Will any FAANG or Musk company accept crypto as a payment? Outcome: No David: 20% (0.41) Scott: 15% (0.47) Winner: 20% (0.41) Not much of a base-rate-able question this one. Mostly an Elon prediction. 39. Will OpenAI release GPT-4? Outcome: Yes David: 80% (0.27) Scott: 88% (0.36) Winner: 80% (0.27) And what a wonderful tool it is. 40. Will SpaceX's Starship reach orbit? Outcome: No David: 90% (-1.05) Scott: 72% (-0.03) Winner: 60% (0.33) I thought Elon was good at rockets. There's probably something for me to learn here, studying the history of new large rockets. 41. Will an image model win Scott Alexander's bet on compositionality, to Edwin Chen's satisfaction? Outcome: No David: 65% (0.06) Scott: 59% (0.21) Winner: 45% (0.51) When you prompt one of the AI image generators, you can give it a phrase describing some moderately complex interaction between different objects and adjectives. But current image generators treat the prompt more as a mishmash of keywords than a sentence describing relations between concepts. The classic example is the prompt "horse riding an astronaut", which invariably leads to a picture of an astronaut riding a horse. Scott had bet that an image generator would successfully be able to respect the intended composition of these sorts of prompts. But progress in this direction has been slower than I'd guessed. This resolution of this question seems controversial (Metaculus was at 90% at the end of 2023), and DALL-E 3 was at least pretty close to satisfying the required technicalities for the bet. 42. Will COVID kill at least 50% as many people in 2023 as it did in 2022? Outcome: No David: 80% (-0.73) Scott: 50% (0.18) Winner: 67% (-0.23) 🤔 Even on excess-mortality estimates, it didn't reach 50%. 43. Will a new version of COVID be substantially able to escape Omicron vaccines? Outcome: No David: 50% (0.12) Scott: 25% (0.53) Winner: 20% (0.59) Fair to say I am not a virologist. 44. Will Google, Meta, Amazon, or Apple release an AR headset? Outcome: Yes David: 25% (-0.41) Scott: 15% (-0.92) Winner: 35% (-0.07) I can't remember if I tried to base-rate this question. 45. Will an ordinary person be able to take a self-driving taxi from Oakland to SF during rush hour? Outcome: No David: 20% (0.16) Scott: 50% (-0.31) Winner: 15% (0.22) The progress of self-driving cars confuses me. My impression was that things were basically stagnant, engineers unable to handle all weather conditions and unusual cases etc. But then there was some regulatory change that allowed them to operate over some wider part of California or something? Anyway I guess I answered this question with my sceptical hat on and was rewarded for it. 46. Will a cultured meat product be available in at least one US store or restaurant for less than $30? Outcome: No David: 15% (0.49) Scott: 80% (-0.96) Winner: 5% (0.60) I can't remember if I looked up how much the price needed to come down, but I'd been quite firmly persuaded that cultured meat was unlikely to be economically competitive any time soon. 47. Will a successful deepfake attempt causing real damage make the front page of a major news source? Outcome: Yes David: 10% (-1.07) Scott: 60% (0.72) Winner: 5% (-1.76) I'm a little salty about this one, and even the Metaculus aggregate ended up at 75% at the end of 2023 instead of 100% because they weren't sure if some obscure CBC article counted. 48. Will WHO declare a new Global Health Emergency? Outcome: No David: 50% (-0.20) Scott: 25% (0.20) Winner: 15% (0.33) It looks like I should have gone a little lower on base rates; I'm not sure why I went for 50%. 49. Will AI win a programming competition? Outcome: No David: 20% (0.50) Scott: 50% (0.03) Winner: 15% (0.57) ChatGPT-3.5 was mindblowing at the time, but this was still the era in which a tweet showing incorrect GPT-generated code could go viral, presumably spread by people who didn't notice all the errors. GPT-4 is astoundingly good by contrast, and I use it often, but beating the best humans on programming tests that aren't in its training data is too much. I won't be shocked if it gets there though. 50. Will someone release "DALL-E, but for videos"? Outcome: Yes David: 40% (0.12) Scott: 50% (0.34) Winner: 20% (-0.58) I'm kind of surprised that I was as optimistic about this as I was: video feels inherently many times more difficult than still images, and even though the quality is... not always great, I still find them incredible technical achievements. But late 2022 was a time of rapid progress, and I made a decent forecast.
That was fun.
I don’t quite exactly reproduce the original set of results: by the wrong scoring, I have Scott at 394th rather than 400th. In the question-by-question breakdown, I get a difference at the second decimal place in one of my 50 scores, and in one of Scott’s 50 scores. My guess is that there was some round-off error in the official calculation.
I think you're correct that Hanson inspired the LW (and thus EA) passion for prediction markets
Thanks for explaining how the scoring systems changed. Am I correct that the variable name Peer_n_ans corresponds to the good scoring system, and Peer_50 corresponds to the previous one, in your https://pastebin.com/yd7eEenf?
Also, I'm a bit surprised that your Pastebin has 3285 participants, against 3296 in the Excel. There were really only 11 participants who answered fewer than a quarter of the questions?
My hash is f22b1 in case you're curious. It would be funny if I were among those 11; I don't remember even roughly how many questions I answered. I don't think I got any email containing my answers, and the Excel file I downloaded only had two columns, hash and score.