Latest ChatGPT model better than 99.7% of coders, qualifies for US math | AutoAdmit.com

The most prestigious law school admissions discussion board in the world.

Back

Refresh

Options

Favorite

Latest ChatGPT model better than 99.7% of coders, qualifies for US math

olympiad. https://openai.com/index/introducing-o3-and-o4-...

,.,.,.,....,.,..,.,.,.

the METR analysis for this is interesting. the benchmark is ...

.,.,...,..,.,..:,,:,...,:::,.,.,:,.,,:.,:.,:.::,.

we're at scary capability levels now

What exactly do "scary" capabilities entail.

average/ordinary/typical citizen/person

making me worry

i say it's scary now in the sense i can clearly see what is ...

.,.,...,..,.,..:,,:,...,:::,.,.,:,.,,:.,:.,:.::,.

"right wing" twitter mischling

That's not pep. That's not diaper.

ctrl-f "accuracy" 0 hits

https://imgur.com/a/o2g8xYK

maybe you should try reading, dipshit. "On a divers...

.,.,...,..,.,..:,,:,...,:::,.,.,:,.,,:.,:.,:.::,.

I learned enough statistics to spot sophistry.

https://imgur.com/a/o2g8xYK

There's a lot of news this week that suggests openai leaders...

part of it is likely with trying to reduce the threat of com...

.,.,...,..,.,..:,,:,...,:::,.,.,:,.,,:.,:.,:.::,.

Bump this thread as soon as anyone besides the government us...

https://imgur.com/a/o2g8xYK

You likely won’t have to wait long at this rate

,.,.,.,....,.,..,.,.,.

Many people I talk to are convinced that they're going to pu...

it's because they realize training models is exponentially e...

They likely thought the hardware requirements would be a moa...

,.,.,.,....,.,..,.,.,.

Cr the industry players are starting to realize that the way...

"right wing" twitter mischling

ITT: people that don't actually know anything about computer...

Cyberpunk Age Pervert

Please enlighten us then

,.,.,.,....,.,..,.,.,.

hardware will still be a huge moat despite reasoning trainin...

Cyberpunk Age Pervert

to expand on this: think about the benefits of being able to...

Cyberpunk Age Pervert

Gemini 2.5 Pro kicks the shit out of all o3 and o4 models at...

A fun implication of LLM coding models getting better is tha...

,.,.,.,....,.,..,.,.,.

2.5 pro appeared to be a lot less retarded than o3 mini high...

.,.,...,..,.,..:,,:,...,:::,.,.,:,.,,:.,:.,:.::,.

It’s a moat when 1) models all scale roughly the same ...

,.,.,.,....,.,..,.,.,.

Data is not the bottleneck anymore. It was clear that proble...

Cyberpunk Age Pervert

I didn’t say data was the bottleneck. I said it’...

,.,.,.,....,.,..,.,.,.

That isn’t quite right. A reasoning model made from a ...

Cyberpunk Age Pervert

Yes but it's not clear that you need to be training for or r...

"right wing" twitter mischling

Real number is 100%. The “0.3% of hominids who are be...

lol we r done here

Poast new message in this thread

Favorite

Date: April 16th, 2025 11:11 PM
Author: ,.,.,.,....,.,..,.,.,.

olympiad.

https://openai.com/index/introducing-o3-and-o4-mini/

Big jump from o3 mini in general. These inference scaling models need to stop progressing soon for them to not to be able to substantially automate AI research.

(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48855738)

Favorite

Date: April 17th, 2025 12:21 AM
Author: .,.,...,..,.,..:,,:,...,:::,.,.,:,.,,:.,:.,:.::,.

the METR analysis for this is interesting. the benchmark is the length of tasks (measured in human professional work time) that a model can complete with 50% probability. it has been doubling every 7 months for the last 6 years.

https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/

https://metr.github.io/autonomy-evals-guide/openai-o3-report/

o3 is about 1.8 times greater than 3.7 sonnet and a larger increase in time than the trend would have predicted. likely comparable to Gemini 2.5 pro but it's hard to tell. we'll likely be at scary capability levels in <2 years.

(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48855928)

Favorite

Date: April 17th, 2025 12:25 AM
Author: Juan Eighty

we're at scary capability levels now

(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48855935)

Favorite

Date: April 17th, 2025 12:29 AM
Author: average/ordinary/typical citizen/person

What exactly do "scary" capabilities entail.

(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48855942)

Favorite

Date: April 17th, 2025 12:49 AM
Author: Juan Eighty

making me worry

(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48855960)

Favorite

Date: April 17th, 2025 12:37 AM
Author: .,.,...,..,.,..:,,:,...,:::,.,.,:,.,,:.,:.,:.::,.

i say it's scary now in the sense i can clearly see what is about to happen and the odds of it not happening in <10 years are rapidly diminishing. but at the same time, i can't currently download the latest version of DeepSeek and ask it to walk me through the details of building a bioweapon (such that any idiot could do it), or designing a zero day exploit, or asking it to fully design a research pipeline for creating an efficient self-improving autonomous agent.

(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48855949)

Favorite

Date: April 17th, 2025 10:14 AM
Author: "right wing" twitter mischling

(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48856645)

Favorite

Date: April 17th, 2025 12:53 AM
Author: That's not pep. That's not diaper.

(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48855964)

Favorite

Date: April 17th, 2025 1:19 AM
Author: https://imgur.com/a/o2g8xYK

ctrl-f "accuracy" 0 hits

(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48855983)

Favorite

Date: April 17th, 2025 1:30 AM
Author: .,.,...,..,.,..:,,:,...,:::,.,.,:,.,,:.,:.,:.::,.

maybe you should try reading, dipshit.

"On a diverse set of multi-step software and reasoning tasks, we record the time needed to complete the task for humans with appropriate expertise. We find that the time taken by human experts is strongly predictive of model success on a given task: current models have almost 100% success rate on tasks taking humans less than 4 minutes, but succeed <10% of the time on tasks taking more than around 4 hours"

reliability decreases with task length, but has gone up considerably over time. reliability is a problem where models are not trained sufficiently well and can't use intermediate tokens to correct reasoning paths. it is becoming less of an issue with every new major release.

(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48855997)

Favorite

Date: April 17th, 2025 1:34 AM
Author: https://imgur.com/a/o2g8xYK

I learned enough statistics to spot sophistry.

(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48856003)

Favorite

Date: April 17th, 2025 1:04 AM
Author: Mailer Daemon

There's a lot of news this week that suggests openai leadership don't believe they can continue winning on model strength alone:
- released a CLI coding agent tool that seems pretty similar to Claude Code and therefore is not super impressive
- discussion of acquiring Windsurf, which is the cheapest and least interesting of the LLM wrapper vscode forks
- trying to launch a social network to compete with the x/grok integration

My opinion as someone working in this space is that the existing models are already good enough for mainstream coding tools, but there needs to be a better interface than clueless users vaguely describing what they want to a chat window that dumps an entire codebase as context. For any other kind of 'agentic' task we are still probably a few years from LLMs matching the cost or reliability of outsourcing to third worlders like that Nate thing.

(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48855974)

Favorite

Date: April 17th, 2025 1:18 AM
Author: .,.,...,..,.,..:,,:,...,:::,.,.,:,.,,:.,:.,:.::,.

part of it is likely with trying to reduce the threat of commodification. even the local LLMs are becoming pretty decent and 2.5 pro is basically as good as o3/o4 mini and it's free. they want to have a complete package to offer people so they'll pay for a subscription or API access.

(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48855982)

Favorite

Date: April 17th, 2025 1:20 AM
Author: https://imgur.com/a/o2g8xYK

Bump this thread as soon as anyone besides the government uses AI to do something mission critical.

(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48855984)

Favorite

Date: April 17th, 2025 9:40 AM
Author: ,.,.,.,....,.,..,.,.,.

You likely won’t have to wait long at this rate

(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48856552)

Favorite

Date: April 17th, 2025 1:27 AM
Author: Mailer Daemon

Many people I talk to are convinced that they're going to pull open API access as a product offering in the near term and switch to a fleet of of single purpose applications like deep research. Biggest obstacle seems to be their poor internal code quality, which would support the decision to acquihire the Windsurf team.

(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48855994)

Favorite

Date: April 17th, 2025 9:43 AM
Author: aqua jeet (🧐)

it's because they realize training models is exponentially easier than originally anticipated. there is no model moat, only a product moat. replacing (or reducing demand for) software developers and artists/vfx/etc. people is the lowest hanging fruit with potentially massive rewards, so they're using their first mover advantage to move in while it's still significant

it's not a bad idea. fully exploiting first mover advantage basically how e.g. facebook and google ended up some of the most massively profitable companies on earth despite their core products being not particularly hard to reproduce five years later

(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48856565)

Favorite

Date: April 17th, 2025 9:49 AM
Author: ,.,.,.,....,.,..,.,.,.

They likely thought the hardware requirements would be a moat too but even the small models are good anymore. Gemma 3, which runs on my normal consumer GPU, is around chatgpt 3.5 quality.

(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48856585)

Favorite

Date: April 17th, 2025 10:06 AM
Author: "right wing" twitter mischling

Cr the industry players are starting to realize that the way to actually make money is to give normies specific tools/interfaces to do specific things rather than just a sandbox AI

All the comments in this subthread are cr

Also another thing to note is that several of OpenAI's recent moves suggest that they're realizing that personalization/'companion' AIs are the biggest home run commercial opportunity in the AI industry. Once everyone has their own AI buddy and/or romantic partner, they're dependent on you, forever. Parasocial Relationships As A Service

(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48856624)

Favorite

Date: April 17th, 2025 9:51 AM
Author: Cyberpunk Age Pervert

ITT: people that don't actually know anything about computer science and machine learning

(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48856588)

Favorite

Date: April 17th, 2025 9:52 AM
Author: ,.,.,.,....,.,..,.,.,.

Please enlighten us then

(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48856589)

Favorite

Date: April 17th, 2025 9:52 AM
Author: "'"'"'"'''

(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48856592)

Favorite

Date: April 17th, 2025 9:54 AM
Author: Cyberpunk Age Pervert

hardware will still be a huge moat despite reasoning training and distillation. especially when you are talking about running a truly powerful model that eclipses humans in intellectual ability. deepseek engineers for example have conceded lacking compute is their main bottlebeck-one posted it on twitter and deleted it when noticed. export controls hadn't even really started to bite in 2024.

(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48856598)

Favorite

Date: April 17th, 2025 9:57 AM
Author: Cyberpunk Age Pervert

to expand on this: think about the benefits of being able to run a true ASI level model at a reasonable speed vs not being able to do that. it's pretty obvious that compute will continue to increase in value over time. hyper long nvidia

(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48856607)

Favorite

Date: April 17th, 2025 10:12 AM
Author: Mailer Daemon

Gemini 2.5 Pro kicks the shit out of all o3 and o4 models at performance per task cost and runs on the new Ironwood TPUs. Also it seems like tinycorp is finally making actual progress on their third party AMD software stack.

(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48856642)

Favorite

Date: April 17th, 2025 10:16 AM
Author: ,.,.,.,....,.,..,.,.,.

A fun implication of LLM coding models getting better is that eventually the CUDA lead is dead since you can prompt an LLM to recreate it for Intel or AMD GPUs or port it to TPUs.

(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48856648)

Favorite

Date: April 17th, 2025 1:45 PM
Author: .,.,...,..,.,..:,,:,...,:::,.,.,:,.,,:.,:.,:.::,.

2.5 pro appeared to be a lot less retarded than o3 mini high in its thought chain. o3 mini high would write 6 pages of bullshit trying to solve LSAT logic games

(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48857378)

Favorite

Date: April 17th, 2025 10:07 AM
Author: ,.,.,.,....,.,..,.,.,.

It’s a moat when 1) models all scale roughly the same as a function of input FLOPs. I think there are good reasons to believe transformers are not the most efficient architecture. They seem to be data inefficient and require several orders of magnitude more language data than humans to reach comparable ability levels 2) you are comparing models at the same point of time. If your competitor can wait a little while for their hardware to get better, training methods to get better and train longer on soft targets generated from the leading model, then the hope of an enduring lead decreases. The hardware dominance narrative starts to look strained when you see the latest Llama model that was released and compare it to DeepSeek.

(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48856629)

Favorite

Date: April 17th, 2025 10:10 AM
Author: Cyberpunk Age Pervert

Data is not the bottleneck anymore. It was clear that problem could be solved using synthetic data and other sources. It's more about the sheer amount of compute required.

Deepseek was based on deepseek v3. The small amount of training they had to do was based on the fact that you "add" reasoning to an already existent model.

(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48856637)

Favorite

Date: April 17th, 2025 10:19 AM
Author: ,.,.,.,....,.,..,.,.,.

I didn’t say data was the bottleneck. I said it’s unlikely they need to train on as much data as they do now. Humans don’t need 30 trillion token training sets. Reasoning and synthetic data is beside the point.

(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48856657)

Favorite

Date: April 17th, 2025 10:22 AM
Author: Cyberpunk Age Pervert

That isn’t quite right. A reasoning model made from a stronger base model is still more powerful. The more you try to move up the more important it will be. You won’t get beyond human level intelligence in every known domain by using small models. Powerful but limited models will be useful but are brittle in noticeable ways. Training out outputs from a model won’t fix this problem. It is notable that Deepseek’s fame is entirely based on having a close to SOTA base model they could build on.

(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48856666)

Favorite

Date: April 17th, 2025 10:07 AM
Author: "right wing" twitter mischling

Yes but it's not clear that you need to be training for or running the most powerful "ASI" model possible in order to be making the most money. See above

(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48856630)

Favorite

Date: April 17th, 2025 9:53 AM
Author: fluid

Real number is 100%. The “0.3% of hominids who are better coders” is a statistical illusion generated by ape self-esteem heuristics and poorly tuned benchmark weights.

(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48856596)

Favorite

Date: April 17th, 2025 2:07 PM
Author: ChadGPT-5

lol we r done here

(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48857512)