I think you're correct that Hanson inspired the LW (and thus EA) passion for prediction markets
Thanks for explaining how the scoring systems changed. Am I correct that the variable name Peer_n_ans corresponds to the good scoring system, and Peer_50 corresponds to the previous one, in your https://pastebin.com/yd7eEenf?
Also, I'm a bit surprised that your Pastebin has 3285 participants, against 3296 in the Excel. There were really only 11 participants who answered fewer than a quarter of the questions?
My hash is f22b1 in case you're curious. It would be funny if I were among those 11; I don't remember even roughly how many questions I answered. I don't think I got any email containing my answers, and the Excel file I downloaded only had two columns, hash and score.
Yes, that is the correct meaning of the fields in the pastebin.
The 11 excluded* all scored near zero by the Peer_50 scoring method, so your score of 0.173 means that you answered at least 13 questions, and probably many more than that. I don't know exactly which row of the original answers CSV file you are, because I don't have the email addresses, and my scores don't quite reproduce the official ones (in the post I speculated that this was due to some round-off error, but I also wonder now about a case like Samotsvety potentially being included in the scoring but not in the CSV file).
*(Three people in the blind-mode CSV file answered zero questions; the next lowest was 7 answers; 97% answered 25 or more questions; 75% answered 45 or more questions. I guess you had to scroll down a long way to submit the form, which selected for people who cared enough to give lots of forecasts.)
The hashes-and-scores spreadsheet has 0.173 from rows 205 to 212; my scoring has 0.173 from rows 208 to 214. If the list of sorted contestants is nevertheless identical, then you are index 299 in the CSV file at https://slatestarcodex.com/Stuff/2023blindmode_predictions.csv ; since my indexing starts at 0 and there is one row for the header, that means that you would be on row 301. Surrounding indices are 1286, 2371, 2971, 1445, 1398, 299, 1810, 1601, 2408, 3151, 1760, 51. You could try looking through the demographic questions if you answered them to figure out which one you are.
Thank you for the link! I sorted the CSV by a few variables that would identify me. It turns out I was among the 2000+ lazy/shy participants who didn't do the demographic questions, so I'll probably never know what my forecasts were.
If I'm really 301 (299) in that pastebin, then apparently I did much better on the other three scoring systems, especially the two Brier scores. Although I am quite surprised, I can tell that you understand this dataset better than nearly anyone else, so I have little doubt that these are my results.
For anyone else reading this thread, I just consulted GPT4 about how Brier scores differ from peer scores (feeding it your text). Its response:
> Penalty Mechanism: The Brier score uniformly penalizes deviation from the actual outcome, whether the forecast is overly confident or not confident enough. In contrast, the Peer score, by leveraging the log score, particularly penalizes overconfidence in wrong forecasts by comparing an individual's score to the community average, which can be significantly impacted by very confident but incorrect forecasts.
> Adjustment for Community Performance: The Peer score adjusts for the mean performance of a community, making it a relative measure, whereas the Brier score is an absolute measure of forecasting accuracy.
I think you're correct that Hanson inspired the LW (and thus EA) passion for prediction markets
Thanks for explaining how the scoring systems changed. Am I correct that the variable name Peer_n_ans corresponds to the good scoring system, and Peer_50 corresponds to the previous one, in your https://pastebin.com/yd7eEenf?
Also, I'm a bit surprised that your Pastebin has 3285 participants, against 3296 in the Excel. There were really only 11 participants who answered fewer than a quarter of the questions?
My hash is f22b1 in case you're curious. It would be funny if I were among those 11; I don't remember even roughly how many questions I answered. I don't think I got any email containing my answers, and the Excel file I downloaded only had two columns, hash and score.
Yes, that is the correct meaning of the fields in the pastebin.
The 11 excluded* all scored near zero by the Peer_50 scoring method, so your score of 0.173 means that you answered at least 13 questions, and probably many more than that. I don't know exactly which row of the original answers CSV file you are, because I don't have the email addresses, and my scores don't quite reproduce the official ones (in the post I speculated that this was due to some round-off error, but I also wonder now about a case like Samotsvety potentially being included in the scoring but not in the CSV file).
*(Three people in the blind-mode CSV file answered zero questions; the next lowest was 7 answers; 97% answered 25 or more questions; 75% answered 45 or more questions. I guess you had to scroll down a long way to submit the form, which selected for people who cared enough to give lots of forecasts.)
The hashes-and-scores spreadsheet has 0.173 from rows 205 to 212; my scoring has 0.173 from rows 208 to 214. If the list of sorted contestants is nevertheless identical, then you are index 299 in the CSV file at https://slatestarcodex.com/Stuff/2023blindmode_predictions.csv ; since my indexing starts at 0 and there is one row for the header, that means that you would be on row 301. Surrounding indices are 1286, 2371, 2971, 1445, 1398, 299, 1810, 1601, 2408, 3151, 1760, 51. You could try looking through the demographic questions if you answered them to figure out which one you are.
Thanks again.
Scott seems to have taken down that CSV file, because I get a 404 error.
Ah, oops. Here's a copy that someone else uploaded a few months ago: https://github.com/jbreffle/acx-prediction-contest/blob/main/data/raw/2023blindmode_predictions.csv
Thank you for the link! I sorted the CSV by a few variables that would identify me. It turns out I was among the 2000+ lazy/shy participants who didn't do the demographic questions, so I'll probably never know what my forecasts were.
If I'm really 301 (299) in that pastebin, then apparently I did much better on the other three scoring systems, especially the two Brier scores. Although I am quite surprised, I can tell that you understand this dataset better than nearly anyone else, so I have little doubt that these are my results.
For anyone else reading this thread, I just consulted GPT4 about how Brier scores differ from peer scores (feeding it your text). Its response:
> Penalty Mechanism: The Brier score uniformly penalizes deviation from the actual outcome, whether the forecast is overly confident or not confident enough. In contrast, the Peer score, by leveraging the log score, particularly penalizes overconfidence in wrong forecasts by comparing an individual's score to the community average, which can be significantly impacted by very confident but incorrect forecasts.
> Adjustment for Community Performance: The Peer score adjusts for the mean performance of a community, making it a relative measure, whereas the Brier score is an absolute measure of forecasting accuracy.