Two studies show AI benchmarks vastly overstate AI abilities
| LathamTouchedMe | 03/16/26 | | Post nut horror | 03/16/26 | | ....;..;...;;;.....;;......;; | 03/16/26 | | LathamTouchedMe | 03/16/26 | | ,.,,.,.,,,,,,..................... | 03/16/26 | | LathamTouchedMe | 03/16/26 | | ,.,,.,.,,,,,,..................... | 03/16/26 | | cardinal swan | 03/16/26 | | cowgod | 03/17/26 | | the walter white of this generation (walt jr.) | 03/17/26 | | ,.,,.,.,,,,,,..................... | 03/17/26 | | the walter white of this generation (walt jr.) | 03/17/26 | | Nude Karlstack | 03/17/26 | | the walter white of this generation (walt jr.) | 03/17/26 | | tancredi marchiolo | 03/17/26 | | computer_smasher420 | 03/16/26 | | adrian dittman | 03/16/26 | | .,,.,..,.,...., | 03/17/26 | | computer_smasher420 | 03/17/26 | | Nude Karlstack | 03/17/26 | | computer_smasher420 | 03/17/26 | | cowgod | 03/17/26 |
Poast new message in this thread
Date: March 16th, 2026 6:08 PM Author: LathamTouchedMe
No doubt AI is groundbreaking. But maybe a little grounding is in order.
Carnegie Mellon study. AI benchmarks so narrowly defined that they only represent 7.6% of all occupational tasks. Benchmarks are disconnected from high-value labor tasks.
https://x.com/rohanpaul_ai/status/2033450821850222811?s=46
Alibaba study. Tested code over course of 8 months. Vast majority broke down over time despite initially passing quality.
(http://www.autoadmit.com/thread.php?thread_id=5846529&forum_id=2\u0026mark_id=3986969#49749191) |
Date: March 16th, 2026 6:11 PM
Author: ....;..;...;;;.....;;......;;
AI is going to be regarded as a joke pretty soon.
It basically has the same value as Excel
(http://www.autoadmit.com/thread.php?thread_id=5846529&forum_id=2\u0026mark_id=3986969#49749196) |
 |
Date: March 16th, 2026 11:20 PM
Author: ,.,,.,.,,,,,,.....................
A joke that spit out the results of a legal research test I gave it in 30 seconds that was much better than anything I'd get from a junior associate after days of research.
(http://www.autoadmit.com/thread.php?thread_id=5846529&forum_id=2\u0026mark_id=3986969#49749973) |
 |
Date: March 16th, 2026 11:30 PM
Author: ,.,,.,.,,,,,,.....................
Latest pay version ChatGPT, forget what it's called.
(http://www.autoadmit.com/thread.php?thread_id=5846529&forum_id=2\u0026mark_id=3986969#49749997) |
 |
Date: March 17th, 2026 12:29 AM Author: the walter white of this generation (walt jr.)
This just isn't true, though. You'd fire an associate that gave you the equivalent of a hallucination on 2 occasions (assuming one prior discovery and warning). If the circumstances were unlucky for the associate, you might fire without warning. Whatever's going on with this -- it was asserted to me in 2024 or so that this was a trivially easy thing to fix, and that is obviously just not the case -- yes it gets blown out of proportion sometimes ("haha AI is useless/worthless"), but it is a huge deal practically.
Hallucinations aside, you sometimes just get point-missing or wrong analyses. This is something you also sometimes see from flesh-and-blood associates (particularly summer associates, which I no-joke stopped hiring because of AI), but it's not good.
The reality is that AI is currently a very real current competitor of SAs on legal issues. The integration isn't there for facts yet, and it doesn't compete with midlevels even on pure-law yet. Now who knows what the future holds....
(http://www.autoadmit.com/thread.php?thread_id=5846529&forum_id=2\u0026mark_id=3986969#49750100) |
 |
Date: March 17th, 2026 10:25 AM
Author: ,.,,.,.,,,,,,.....................
A lawyer certainly shouldn't use AI as the final draft in an area of the law he doesn't know about, and of course you should double-check its work. It can be hit and miss. But the hits are a lot more common than the misses after the ChatGPT paid version release last month. In my research, it actually said at times "I know this isn't exactly the kind of case you were looking for, but ..." and didn't hallucinate at all.
The work product it gave me was simply way too good to not at least allow it to try to take a crack at any legal project within its areas of competence. What do you have to lose? It's freaking 20 bucks a month and spits out great work in 30 seconds.
(http://www.autoadmit.com/thread.php?thread_id=5846529&forum_id=2\u0026mark_id=3986969#49750574) |
Date: March 16th, 2026 11:21 PM Author: computer_smasher420
all the models are trained to game the benchmark tests
they're completely meaningless
(http://www.autoadmit.com/thread.php?thread_id=5846529&forum_id=2\u0026mark_id=3986969#49749980) |
Date: March 17th, 2026 12:23 AM Author: .,,.,..,.,....,
One of the major reasons why labs are prioritizing coding/swe (in addition to being a relatively easy revenue source) is that they intend to use the models for AI research. Deep learning is almost entirely an empirical field with a thin amount of theoretical justification for architecture and training regimes, so the ability to rapidly test new systems is essential. If SWE agents can provide plausible architecture ideas and implement them (or test out a variety of ideas specified by human programmers), the model iteration loop becomes much faster. Not to mention the total training compute deployment in a few years will be orders of magnitude what it is currently, which will decrease large scale training run time substantially. The point isn’t that the models are getting substantially better on all tasks currently. it’s that they improving extremely rapidly on the tasks needed for model self improvement.
(http://www.autoadmit.com/thread.php?thread_id=5846529&forum_id=2\u0026mark_id=3986969#49750077) |
|
|