Recent and related:
Ask HN: Is it just me or GPT-4's quality has significantly deteriorated lately? - 36134249 - May 2023 (711 comments)
I think we don't notice our expectations have gone up, and we don't notice that we remember the hits and then expect all hits.
We didn't notice the misses at first, because it's what we expected to begin with, and we very strongly noticed the hits because they were unexpected. Now we notice the misses and expect the hits.
No, this is peak corporate gaslighting, OpenAI being the paragon of integrity and all. Nobody should trust a goddamn thing that comes out of their mouths-- or their product.
It's no coincidence the flat-fee service is visibly crippled, while per-request API users are not reporting any difference. (edited)
Ignoring everyone here, look at the other "hacker" groups-- like the jailbreaking community. They have a lot to say about recent changes that coincide with their hacks not working. All of a sudden, with OpenAI supposedly changing nothing, technical bypasses just stopped working. OpenAI changed nothing, so this must be deus ex machina.
I'm not even jailbreaking it but the results I get for simple code requests through the web UI have become unusable garbage. It puts less effort into responses than an unpaid-and-overworked intern. Others here report the same. The current "iteration" seems hellbent on terminating conversations as quickly as possible once they stray from the explicit scope of the original topic and is almost as hostile to fixing its own errors as it is to endorsing eugenics, whereas in the beginning it would humor every idle thought I threw at it in long conversations. You can literally see this reflected in the logs they forced retention of. It's acting like a customer support rep desperately trying to end a call before it exceeds a call-time quota.
This particular current workflow seems like it lends itself to better organization of training data-- conversations are what the title says they're about. It also seems like it lends itself to anti-jailbreaking because pretexting it with irrelevant information forces a change of scope-- and a summary termination.
But OpenAI says they changed nothing, so rather than one guy lying without consequence, a community of professionals using and abusing the tool must all be victims of rhetorical fallacy? Nobody's qualified to reverse-engineer corporate bullshit anymore without being infantilized...
(Liars running a black box-- what could possibly go wrong? We need regulation!)
I think this is the right way to look at it: we are very quick at updating expectations.
The first flight is magic, the nth one is a chore.
I think it's worth noting the response explicitly said the paid API model hasn't been changed, so ChatGPT could have been changed through many different ways outside the core model
The refusals are new. I’ve gotten a few myself recently. I’ve never seen them before. They’re not moderation related and they always include some text about the task being too complex.
He specifically singles out the API and never answers if chatgpt is same. Most people are probably using the chatgpt interface, which seems to have more alignment training.
Yeah, it's kind of the same way people have been saying Google search has been going downhill for years, but compressed into a much shorter time frame.
Genuinely asking: have our expectations gone up, or have we started thinking about it in a more critical and realistic way after the initial glean has worn off?
Actually the opposite. The quality of that other big hit (Dalle) has gone noticeably down IMO. I feel like they have over trained it, trying to compensate for the initial shortcomings, and in the process of trying to solve those by throwing more and more data at it the machine has forgotten how to deviate from sterile stock photos and do anything interpretive. For example, I asked it for "Mother of Exiles" (a reference to the inscription at the base of the Statue of Liberty), in return it showed 4 generic stock photos of female characters, sometimes with a visual prop to vaguely insinuate motherhood but certainly nothing inspiring and certainly nothing to do with exiles. Compare this to the early days, when people were throwing natural language statements at it like "power bottom dad for the people" and it was spitting back results that felt like it truly understood what any of those words meant.
I don’t think it helps that the bottom of the ChatGPT-4 dialog window says:
ChatGPT may produce inaccurate information about people, places, or facts. ChatGPT May 24 Version
I was shocked that an employee was claiming they haven’t changed the model since March, given that this date changes every few weeks. But I finally read the link it goes to, and apparently this date is just the frontend version. Which is a really weird thing to include as part of a disclaimer.
I wonder how many other people think this is the most recent model release date.
I definitely agree with this take. I had similar feelings to everyone in yesterday's thread complaining about the models failures but I agree that its just our shifting expectations and not a model update.
I'd frame it another way, many, many people fell for the hype, immediately.
Yeah. The recent GPT hype reminds me a lot of a Reply All episode on old school text generation using Markov Chains. I think it was haikus. Sooo much ascribed meaning, and equally ignoring the "misses".
This is not about misses, this is about agents using the API breaking down because gpt4 is no longer able to follow the response structure.
People just overvaluing the quantity.
This. I've been using it daily and nothing has changed. It's still as great as day 1.
The audience of HN is, as Taleb would say, intellectuals yet idiots. They have trouble measuring change and are prone to hyperboles. Some still try to minimize the impact ChatGPT will have, while focusing on bullshit like 'hallucinations', or nitpicking about the quality of the code and so on. Can't see the forest from the trees.
If you're looking for intelligent discussion, look elsewhere.
"GPT-4 hasn't gotten worse since March" can be 100% true at the same time OpenAI puts more rules and limiters keeping more interesting answers from being said.
Ive noticed it quit giving as detailed answers and as thorough. It's also refused to do more complex programming where it used to accept those questions.
Being artificially limited by OpenAI can still be done without it getting "worse". But it effectively is worse for us users.
I can hardly wait to see when, in the near future, every new LLM may become almost useless.
The OG models were trained on real world, human generated content (for the most part at least). Starting in 2022 the cost of automatic generating "human sounding enough" text has gone to such low depths that I expect it to be pretty much impossible to avoid training any model on text already generated by a LLM.
What will the result of this feedback loop be, I can't tell. It will probably be just an even more generic, corporate speak, bland sounding bla bla bla than we get today, and the level of hallucinations may get even worse.
In a way it makes me happy to imagine that the most dangerous tech humanity ever invented may itself be it's own main obstacle to future refinements.
His point is that it literally didn't change, that includes the safeguards (which are a part of the model).
No, it's not gotten worse, just more constrained and... fuzzy and blurry.
I look at everything someone from OpenAI says as if a politician is saying it. Sam Altman especially is fond of statements that are deceptive but technically true. His employees appear to be following his lead.
GPT 4 isn’t ChatGPT 4, which is what most people use.
There is also the “system prompt”, which is also likely to be changing but not part of GPT 4.
Etc…
I don't see this as being political doublespeak. He is very clearly just talking about the API.
I haven't followed personas closely. Do you have examples of Sam Altman saying deceptive but technically true things?
What is "ChatGPT 4"? ChatGPT Plus can use GPT-4 and free ChatGPT uses gpt-3.5-turbo.
Recent research showed that RLHF/censoring the model hurts the performance of the model. This is intuitively obvious, censorship isn’t real, the data is (moral issues aside). So it hurts the integrity of the weights. The future is open sourced uncensored models, capitalism will demand the high performance. There’s a huge discussion on it on Reddit right now:
https://www.reddit.com/r/MachineLearning/comments/13tqvdn/un...
People keep mixing up alignment as in do what I actually asked and weird moralizing.
You will pay the alignment tax no matter what if you want a model that actually does things. It's the reason why Llama models do better with stream of thought prompting where you can ask GPT questions and it won't try completing the question.
You can, and probably should, align some semblance of morality and ethics for user-facing models that do things. It's customer service voice but for AI. If what you want out of a model is a vague mirror of humanity or maximum smarts at the cost of needing more detailed prompts then yeah, this stuff probably annoys you.
Finding a balance between a model that gives the outputs humans actually want and the pull that training data has on the rest of the model making it stray from the theoretical "best" output is hard.
Just fyi this research is found in the original GPT-4 technical report and Microsoft’s GPT-4 assessment. There wasn’t an attempt to hide that discovery or anything like that.
Maybe humans are like that as well? :)
It shouldn't be limited to censoring. Fine tuning is done on small set of some domain specific samples. Surely that results in gradual 'forgetting' of the already learned data from big set. If run for long time model will relearn on the tuning set, and overfit most likely.
> The future is open sourced uncensored models, capitalism will demand the high performance.
No. The future is ML built on more than just Wikipedia, Github code and Reddit hot takes. Or, sarcasm aside, GIGO: garbage in, garbage out - when you don't take care what datasets you train any kind of model on, you're bound to get some surprises if something unexpected comes along. Be it the infamous "racist soap dispenser" or the Google (?) image classifier that made the rounds here just a day or two ago which had a safeguard because it kept confusing Black people with gorillas.
Preventing this kind of harmful content isn't censorship - it's after-the-fact compensation for bad training (and the tendency of 4chan and other trolls to use discriminatory content generation as a weapon, just remember what they did to Microsoft Tay).
That is where the money is: curated training datasets. Everyone and their dog can train a ML model from scratch, all you need is money and cloning a few more-or-less-broken Github repositories for that. But acquiring a high-quality dataset as a foundation? That costs real money to create.
My experience using GPT-4 for coding is that it's got the knowledge and skill of a senior engineer, with the high maintenance of a junior engineer. You can get it to output small sections of quality code, but the amount of prodding it takes to piece it all together means you may as well have spent the time writing it yourself. But the future of GPT as a coding assistant is definitely bright. It just needs more chaining, so I feel less like its assistant, asking it to come up with prompts and then pasting them back to it after iterating on the code.
Those aren't very good senior engineers then lol. GPT-4 has the coding skills of an overconfident junior engineer and little else.
I agree the future of GPT-X models as coding tools is bright. But for actually doing engineering work outside coding (or even delicate changes in an existing code base) much less so.
GPT-4 has incredible breadth and no depth.
It's an excellent complement to a strong programmer, who will have incredible depth -- and may have breadth relative to other coders, but will be very narrow in the whole space of programming.
Just don't use it for things you're already an expert on, except perhaps for starting out / bypassing boilerplate.
The main thing it's lacking is a client-side merge/error-detection engine which combines all of the previous outputs.
Anything over ~100LoC and it gets sloppy including it all in the code block. Which is fine, because I don't need it in your context window if that bit is working and you're not relying on it.
Though I have gotten outputs up to nearly 250 lines from it..
Automatically feeding back code details + Traceback messages also fixes errors decently often enough.
Agree. However I noticed an unintended benefit of using GPT for coding is that it helps you think carefully about the problem you are trying to solve when writing the prompt.
It’s still way faster than I could hope to be. It’s a true refactoring machine with 0 mental effort. “Here is this code block, make it do this other thing, add this feature and argument and error condition” etc. It augments the amount of output that’s possible to achieve at any seniority level.
> so I feel less like its assistant
Ah so I'm not the only one who feels this way lol. It's as brilliant as it is dumb.
It's Einstein with dementia/alzheimers.
They just put out a blog post about how they’re doing exactly that by rewarding chain of thought style reasoning.
Generative AI will not be able to do that in any true sense. Not the way you're thinking about it.
For me it is not even the code per se but how it helps me with naming database tables and properties when I describe the broader scope.
The future is more than bright, especially with a new model that is more current.
Offtopic but just today I was aksing it for the most current versions it knows of right now:
Python: 3.9
JavaScript: ECMAScript 2021 (ES12)
Java: 16
C++: C++20
C#: 9.0 (.NET 5.0)
Ruby: 3.0.0
Swift: 5.4
Go: 1.16
Rust: 1.51.0
TypeScript: 4.3
PHP: 8.0
Kotlin: 1.5.0
Scala: 3.0
R: 4.0.5
Perl: 5.32
And frameworks:
Python:
Django: Version 3.2
Flask: Version 2.0.1
Pyramid: Version 2.0
TensorFlow: Version 2.6.0
PyTorch: Version 1.9.0
JavaScript:
Node.js: Version 16.9.1
Express.js: Version 4.17.1
React.js: Version 17.0.2
Angular: Version 12.2.4
Vue.js: Version 3.2.6
Java:
Spring Framework: Version 5.3.9
Hibernate: Version 5.4.32.Final
Struts: Version 2.5.26
PHP:
Laravel: Version 8.54.0
Symfony: Version 5.3.6
CodeIgniter: Version 4.1.4
Ruby:
Ruby on Rails: Version 6.1.4
Sinatra: Version 2.1.0
C#:
.NET Core: Version 5.0
ASP.NET: Version 5.0
Entity Framework Core: Version 5.0
People have just fallen out of love with it and are realizing how it really isn't that good.
Sorry "prompt engineers" but papers on arXiv show that when you give it fairly sampled problems it struggles to get the right answer more than 70-80% of the time. When you are under its spell you will keep making excuses but when you are looking at it objectively you'll realize the emperor is naked.
If you give it very conventional problems it seems to do better than that because it is a mass of biases and shortcuts and of course it will sentence Tyrone to life in prison because it's a running gag that "Tyrone is a thug"... That's how neurotypicals think and no wonder why many of them think ChatGPT is so smart... It mirrors them perfectly.
>When you are under its spell you will keep making excuses but when you are looking at it objectively you'll realize the emperor is naked.
It's not perfect, but you have to admit the emperor is at least wearing a thong. It is the second most intelligent thing on the planet at creating text, even with its flaws. Putting that accomplishment in league with naked emperors is astoundingly biased.
I don’t know, I think there is certainly a lot of hype but it is still pretty damn good. I used it every day for a wide variety of coding tasks and more often than not it is correct. As in compiles, produces correct output for inputs, and mostly writes reasonable code. Has it made software dev obsolete like some maximalists said it would? definitely not, but it seems equally hyperbolic to say the emperor has no clothes, it is undeniably a useful tool to me.
I used it to help me write an email begging for my job back and it was pretty helpful
I very much doubt the tweet in question is of someone who loved GPT-4 yesterday, and has "fallen out of love" with it starting today.
For someone so dismissive of "neurotypicals" you did the neurotypical thing and drop a hot take before clicking the link.
In the thread yesterday it was brought up that if you use the API it feels considerably less hamstrung than the ChatGPT client, and this tweet seems to fit the assumption that the ChatGPT product is being tuned or governed differently from the API.
>The API does not just change without us telling you. The models are static there.
This reads to me as specifically indicating the models are not static elsewhere, ie, in ChatGPT.
It feels less hamstrung on output, but is instead hamstrung by being incredibly slow.
GPT-4 via API will sometimes take 30 seconds + to respond to simple questions without any chat history, where through Chat GPT it will give you near enough instant replies only slightly slower than 3.5 turbo.
A couple observations from attempting to use both GPT-3.5 and GPT-4 via the web interface for coding tasks:
- The model's ability to respond accurately drops drastically when asked questions of the form "is there a different way to accomplish X, using Y?" or "is there a way to accomplish X that runs in O(log(n)) time instead?" Example: I wanted to upsert an integer value using a SQLite db using "INSERT ... RETURNING..." ChatGPT repeatedly told me that sqlite doesn't support "RETURNING" (it does, since March 2021). It insisted I would need two DB round trips from my application to accomplish this. When asked "can this be done in one round trip, instead?" it repeatedly wrote code that would return the number of rows modified instead of the integer column value.
- ChatGPT's limited standard library knowledge means that the solutions it produces, even when correct, are often lower-level and less idiomatic. Problems that would be trivially solved with e.g. a Java.String.replaceAll or .codePointCount will instead loop over each character, often splitting the string into an intermediate array and implementing special cases for first/last character edge cases. The code winds up being mostly correct, but also (for lack of a better word) weird. No human I've ever worked with would do things the way ChatGPT sometimes does, which means the code will likely be much harder to maintain and debug over time.
> ChatGPT repeatedly told me that sqlite doesn't support "RETURNING" (it does, since March 2021).
Its dataset cuts off around 2021. There's a little footer message warning you not to expect knowledge of recent events.
> are often lower-level and less idiomatic
I would go as far as to say it's adept at producing technically working spaghetti.
Did you try copy pasting the relevant Sqlite docs into ChatGPT?
OpenAI turned on the "full featured" GPT4 just long enough to learn what people could use it for. Now they turn off those features and use all of those ideas to spawn new companies on a now-private GPT4 Original api and cripple everyone else. When they say "don't worry about building on our API we won't go up the value-chain" -- no shit, but only because they don't want regulatory scrutiny. So they will just trickle-out API access to companies that they own and control through extensive personal networks. Google, Facebook, Twitter, Amazon... Build on someone else's API and they will jam you 100% of the time.
Posted this in yesterday's thread, but once again I think this is just people feeling the magic wear off. People have poked around a lot more and found the flaws while also trying to get it to do real-world tasks. That wasn't true when it first came out.
It's fine, this tech has never been magic anyways, won't be replacing all our jobs, won't take over the world, etc. It's still awesome for what it is.
it’s a very neat parlour trick
ChatGPT was a massive dopamine hit, particularly for people in areas like development. It was a tremendous release that has definitely laid out some new tracks for the future, especially on the web. I myself have found to be using ChatGPT a lot less recently.
I got the GPT-4 API access and then I realized that I can't really use it for anything super major because I can't afford it, it is ridiculously expensive if you consider that you have to pay for all the failed requests, the wrong information or the wrong context also. Instead, I have written a bunch of Python scripts that do a select few tasks for me and I have my terminal open 24/7 anyway.
As for the topic at hand, I have _definitely_ noticed a lot more disclaimers in the UI. I don't get it from the API at all, in 6 months that I have been using the API - I've gotten one disclaimer.
In the ChatGPT UI - I get them a lot. "Remember this", "Remember that", "Always look up the information" and things like this. I mean if it wasn't happening I would know because I have been a power-user pretty much all this time...
> it is ridiculously expensive
What? It's 3p/6c cents per 1k tokens.
I use GPT4 for all kinds of things, but a very basic example: Automatic API client/stub/endpoint generation with GPT4 for reasonable/typical data structure sizes, 4k tokens buys me GET/POST/DELETE/PUT for all of the above.
The completion request finishes in about 90 seconds. A human junior developer would take nearly an hour. In the best case, maybe 20 minutes using something like Swagger.
So that's ~20 cents versus $60 or so, to say nothing of the time.
And this extends to everything else: writing tests, refactoring, writing configuration files, UI generation, database schema and SQL query generation, etc.
What the hell are you even talking about?
The actual statement is "The API does not just change without us telling you. The models are static there." But that isn't ChatGPT.
Here's how I feel ChatGPT answers my coding questions now:
Me: Write a Python script to sum two numbers.
ChatGPT: Python is a programming language that was invented in 1991 and can be used to solve a variety of programs. Here is an example of how to sum two numbers:
def sum(a, b):
# note: the actual code has been left out as it depends on the actual specifications of how you want to add\*
Note that this code is merely an example and writing a Python script to sum two numbers is a complex problem that requires careful attention to whether the numbers you are trying to sum can be summed. Also, as my knowledge has a cutoff date of 2021, there may be other ways to perform this summation. Please check with the documentation or ask someone who knows how to code.* note: ChatGPT has actually done this to me
I thought the API was versioned, though it doesn't mean the new versions aren't worse. And he doesn't talk about the model used in chat gpt. I'm a bit skeptical that this answers exactly what we were worried about. Or maybe we were not all worrying exactly about the same thing.
This doesn’t pass the smell test for me. Something has changed. Maybe the model hasn’t changed, but something somewhere is making output worse.
I’m honestly more concerned if OpenAI doesn’t even realize it. Nothing is more infuriating as a user than convincing the developer your bug actually does exist. It speaks to poor monitoring, testing, and tooling.
Who knows? Maybe this engineer doesn’t know what’s going on behind the scenes. I would think the models storage is so tightly guarded only few would have access to even know if something has changed. The ones that know would have nda
His role is Developer Relations fwiw
I doubt the March 14 model is any different, since it's versioned and I don't see why OpenAI would want to change it behind the scenes.
Also, companies are evaluating GPT-4 to determine whether they want to pay for it, so OpenAI has an strong incentive to not downgrade at least the API.
I believe the May 5 model is different, at least in the chat interface, because it's fine-tuned to detect jailbreaks and the temperature/other hyper-parameters may have changed. And I can imagine this fine-tuning making the model less creative and worse at solving analytical tasks.
Personally I haven't noticed any change, except in my own awareness. Sometimes GPT4 gets very hard prompts right, and sometimes it gets simple problems wrong. So it's not hard to see how people can form biased opinions from selective attention or just luck.
I just discovered when playing around with translations that there is some hidden filter/killswitch that immediately stops the generation of the opening of some books. It doesn't matter if the invoking prompt is to recite the book opening paragraph, or to translate it from a foreign language to English, or what.
It's not RLHF induced because it works via API and it only triggers in English, but sure enough try to get it to output
>Call me Ishmael. Some years ago—
>"It was the best of times,
I guess this might get you flagged (there is no alert to the user that this filter kicked in, and it will output it in any other language, and it works in the API) so I'm hesitant to play around with it more, but it's very strange - especially as these are long since in the public domain.
They're trying to stop students using ChatGPT to cheat on homework assignments.
ChatGPT Plus shows the version. It was May 12, if I remember correctly, then it changed to May 24. Which probably means there were some changes. If not in GPT-4 itself, then in pre or post processing. They should have some safety filters at the end, I think.
I mean, there were UI changes. Could be just that.
It's really fascinating to see all the irrational thinking triggered by ChatGPT. "risks of human race extinction", "soon no more need for developers, doctors or lawyers", "possibility of consciousness emerging", so called experts in "prompt engineering".
It's a dumb tool, if you're lucky you can get it to spit something useful (but you need other tools to check the correctness of what it returned). There are certainly many useful applications, but the technology is inherently limited.
Sure, but you fail to understand how rapidly this technology is changing. Yes, today it's a machine for spewing out glibly rephrased reddit posts. But think back to the 1970s when jetpacks were first demonstrated. Today we're all flying around on them; the automobile, aviation, and flame-retardant headgear industries have been radically altered, and humanity will never be the same again.
If they are overwhelmed with usage requests and can't buy GPUs fast enough, the model could be the same but the amount of compute spent on the answer could be ramping down impacting the quality.
It would be really cool to quantify how much compute spend per interaction.
If this was visible, users could spend an arbitrary amount and modify how much they're willing to pay for 'better' responses. This is probably a better business model.
Many comments miss the mark as they fail to make the crucial distinction between ChatGPT and GPT-4. GPT-4 is the underlying model one can indeed have direct access to on a pay-per-request pricing scheme. ChatGPT is an application built on top of GPT-4 which manages how the 'context' of your interaction is passed in on a per-request basis. I don't doubt the spokesperson for a minute: from my own experience, the underlying GPT-4 models have not changed and I sincerely believe that OpenAI will be careful on this front, given that they are aiming to build a once-in-a-generation company that provides a stable platform for other firms to build products on top of.
The ChatGPT application, on the other hand, and how it manages context etc has certainly changed in the intervening time. That is completely expected as even and perhaps especially OpenAI is figuring out how to build applications on top of LLMs, which means balancing how one can get the best quality results out of the model while making ChatGPT in particular a profitable business.
Stratechery has analyzed this problem for OpenAI in the most detail I've seen. I imagine the company is in something of a bind figuring out how to invest between the APIs themselves and ChatGPT. On the one hand, the latter is incredibly successful as a consumer app with a lead it will be difficult for rivals to catchup with and it is likely plugins will provide a good revenue basis. On the other hand, there is certainly a greater business opportunity in being the foundation for an entire generation of AI products and taking BPs off of revenues -- if and only if GPT4 indeed has a significant moat over the opensource alternatives. For the moment, it would seem they will have to hedge both bets as we see how the consumer space and the competition between models heats-up.
This is why I unsubscribed from OpenAI.
They seem to be virtue signaling about their lack of progress now. Months later, GPT-4 still slow, still not multi-modal as they advertised, still significantly limited, you need to sign up for a waitlist for almost every feature, no sense of privacy, no understanding of their plan for improvements. Google is full steam ahead and consistently improving their free LLMs.
They actually had a genius strategy. Put out Bard with a very stupid LLM, so people aren't blown away and it doesn't get the doomsayers on their case. Now they can continue to quietly upgrade Bard. Eventually it will be so obvious that they have surpassed OpenAI.
OpenAI must enjoy watching their unsubscriber count go down. After all, Sam did say at the congressional hearing multiple times "We would prefer if people used it less".
There is No War in Ba Sing Se.
This was predictable from the get go and many had had pointed that out already that soon people will start noticing that LLMs aren't as magical as they thought once the initial awe wanes away.
Don't think OpenAPI is doing anything here, it's not in their interest to reduce the "quality" even there's no objective and repeatable way to measure the quality either.
It's all probabilities all the way down. Who knows what the model will do. I mean, you can dry run by hand but even on quad core processors, it's damn slow so imagine the inference by hand.
> This was predictable from the get go and many had had pointed that out already that soon people will start noticing that LLMs aren't as magical as they thought once the initial awe wanes away.
Familiarity makes something novel appear trivial. I don't think the magic going away has anything to do with the magic of the underlying technology. Airplanes are amazing technology, but they become as boring as a car to regular fliers.
Was anyone in that thread claiming that the API had gotten worse? A lot of people were suggesting to use the API rather than the ChatGPT interface in order to avoid the degradation.
He said "the API". He didn't say anything about the Chat version, which seems to have more protections that may be on top of the model and not embedded into it.
Are we chatting directly with the model? Maybe the interface has changed. With long term use the likelihood of hitting edge cases is higher and maybe that is a cause as well for what users are seeing. People probably ask more vague questions over time. I might have done that.
I have never experienced the amnesia problem in v3.5 though, that v4 clearly has. Just repeating incorrect answers that you ask it not to give. I did not have access to v4 in march so I can't do that comparison.
I've found it's gotb eyes and have given up using it for tasks more often than I used to.
For copilot, I no longer get multi line complete suggestions and it's really slow to deliver single line suggestions and they're more often incorrect. I need to dig into it, but it's definitely degraded further and I don't know if it's just my environment or a wider issue. I need to dig in and figure out - is anyone else experiencing these things?
I can believe that the core GPT-4 model itself has not changed, but clearly they've changed some features. They've added plug-in access, which can change GPT-4's capabilities greatly with some chats.
The over all UI of the web site has changed several times (dropdown for GPT-3.0/3.5/4.0 turned into a GPT 3.5|GPT 4.0 button, they added the ability to share chats, and I'm sure there are other small details).
Gpt is starting to suck? I'll take the blame for this one. I've been submitting crazy prompts since the beginning in the hopes of confusing it!
Is there a way to use GPT-4 directly? Instead of the muzzled ChatGPT that is supposed to give answers authorities and opinionators deem appropriate?
Assuming you mean the base model, no not right now. GPT-4 base model is not public.
Unrelated to the model itself but infrastructure, yesterday was unusable for me, getting random "too many requests sorry bout that" errors. I think 1/4 of the requests during a 3 hr period didn't make it through. Imposible to build anything beyond experimental stuff on top of a service so unreliable. Haven't tried it through Azure yet, I wonder if it is any better?
Whatever happens to chat gpt, ai image generation is incredible. It’s so incredibly powerful that I don’t think it’s ever going away
If you are doing some indie gaming, it can save tons of money on asset generation, you either ramp up and finalize it yourself or hire somebody but for fraction of the time. Ditto for web design, why would anybody but big, calcified megacorps with their ridiculous processes go and buy some stock images if you can get exactly what you want, in any imaginable style?
The amount of times per day I have to click "Stop generating" and then say "No!" has definitely increased
Does the author refer to GPT-4 API or the ChatGPT version of GPT-4? The recent discussions on GPT-4 quality deterioration seem to focus on the ChatGPT version rather than the API. Also, since ChatGPT version of GPT-4 is now supporting web browsing and plugins, I would assume it has to have been updated.
Must be in the trough of disillusionment. The tech is truly useful though so we’ll reach the plateau soon enough.
The model didn’t change but doesn’t mean the inference didn’t change.
Without going specifics it is meaningless for the discussion
The models for the api's could have not changed but the web app's might have too
Model has not changed. Prompt transformation code have changed.
So the same prompt you do are delivered to the model with a different "wrapper" prompt, significantly changing the answer model is producing.
I agree I've noticed I need to be quite specific now, it won't realize bugs in the code unless I tell it so. A head to head comparison of the different versions is needed to validate this.
The model may be the same, but the chatbot is not. They have made the responses shorter to save inference expenses, which are huge in a model the size of GPT-4.
I read somewhere there was a paper they released that as they tuned for alignment, overall quality dropped proportionally. Anyone know the name of it?
Search for alignment tax
What is “alignment” in this context?
He’s talking about the API. Not the web client
To what degree is the pre and post processing of the chat client the source of the confusion rather than GPT-4 itself?
I don’t get why is this being discussed, isn’t it easy to go check old conversations and try to replicate them today?
You would think so, but not a single person complaining has provided convincing proof of the degradation.
GPT isn’t deterministic
You need to have used the API and set temperature zero, but if you have that historical data you can reasonably test and see.
I use GPT-4 a lot and something has definitely changed in my recent experience. It's simply worse.
Instead of complaining, why not show the benchmarks?
Like: first it scored 83, now it scores only 42 (or whatever).
What benchmarks? We have no way to go back in time and run the old ChatGPT accessble models though benchmarks.
And to my knowledge, no one has copy pasted an entire benchmark into the ui in order to run it, they just use the API.
the commenters of HN are simply so smart that their hunch clearly holds more weight than scientific rigor.
What can we do to upheave this organization and their blatantly anticompetitive tactics?
Perhaps the evolution of this story is an interesting example of confirmation bias?
This is basically a tacit admission that ChatGPT4 *has* gotten worse.
I understand why people don't read 10,000 word articles, but this is a 15 word tweet. Would it really kill people to read it before commenting? He is very explicitly talking about the API, which uses a different model than the UI.
The recent discussion was about the degradation in the UI model.
Is it static? If I ask it the same question I asked it a few months ago it gave me a completely different answer. Is that because of some additional context above? Should we be starting fresh chats sooner?
There’s _is_ a degree of randomness in its content generation, you can see it in action by hitting regenerate. It’s a statistical text predictor that iteratively selects the most like my word after the ones it’s selected so far, but when there are a few good candidates, it’ll choose one at random.
However, the model is static (it’ll present the same candidate word list until retrained), which is what you may have heard. But the way a response is generated introduces the randomness.
Note: I forget the parameter, but if you get the direct API access you can turn off this and get consistent answers.
It's never been static and there's always been randomness when you regenerate an answer. There is no control of a random seed, so you can get a different answers for the same prompt.
It’s not stactic. That’s the thing about ai, you ask it the same thing it gives you a different answer/different image every time.
GPT 3.5 is a much better coding assistant.
The quality is noticeably worse than 4 but the tighter feedback loops and lower expectations make up for it most of the time. So I’m inclined to agree; speed matters.
If ChatGPT has been static since March (of 2023) the. Why does my version always change? Right now I'm running May 24.
FWIW, I think it has improved.
Yea that is very tough to believe
Company who has repeatedly lied in public: there’s nothing to see here!
Today I had ChatGPT 4 (Not GPT 4) correct itself mid-response. I was asking it something very simple about regexes:
> How about matching 'a' as the second character of a string only?
It responsed with the wrong regex plus a bunch of explanatory junk:
> '^a.'
Then halfway through the explanatory junk, it corrected itself like this:
> Apologies for the confusion in the first response, the correct regular expression should be '^.a' for matching 'a' as the second character of a string:
And kept on with the (now correct) explanatory junk.
All in a single response. I've certainly never seen that before (if someone has, please weigh in). Maybe the model hasn't changed, but the pipeline has? Like... there's a second model trying to correct the mistakes of the first, maybe? (timings are probably wrong for that, but something like that)
I have gotten that quite a few times now and it does seem to be new. In my case though, it acted as if it would change something, but then didn't, and got stuck in a loop. I got it enough times that it seemed to happen more often when the output needed to be quite long, so I had assumed (could be wrong) that when the output was nearing a certain limit, they had implemented something that would start a new output, such that the interaction of "please continue the script" would not be necessary.
Unfortunately though, it would start over with the script completely, and then get stuck in a loop until it broke, this not being able to even save the response at all.
Something definitely feels like it changed, but I suppose it could just be more use of the system.
Another possibility is related to some strange issues with hardware and balancing etc, despite not changing software or params. It's strange, and it shouldn't happen, but sometimes things change from simple batching and balancing, or lower level hardware related things, which are very annoying to debug.
ChatGPT 4 seems to struggle with context changes within the same thread now. I asked it about some parks near me and then switched to asking about code. It just kept responding about the parks even when I corrected it. In some cases, it would merge the park question and the code question. I had not seen that with the free or Plus version until a few days ago.
Interesting. Bing chat does this when you get it to talk about naughty things. My personal favorite is convincing it to make weird art out of quotes from the Tay chatbot. It will write them until it says a prohibited word or phrase, or touches some forbidden part of it's state space.
I wonder if its been trained on some of its own past conversations now, and a lot of those conversations contain "I'm sorry, the previous response was a mistake. Here's the correct version..."
There’s also a new button “continue generating” so they are changing some things
I think you might both be right about this.
But then, all the more power to the open-source models and UIs, which are unrestricted, free to use, have better UX, and are constantly improving [1]. Ergo the suspicion there might be something else behind the desire to regulate it by OpenAI. Though hopefully we’re all just collectively cynical and they have good intentions after all, despite the misalignment in incentives.
I do think academics at universities are not to blame for this though, they are just as much flabbergasted and/or mislead as everyone else.
In any case, rather than wasting our time and energy on (fighting) ChatGPT(Plus) or GPT4, we should* all be collectively contributing to improving the open source models, by using them, reporting bugs, contributing ideas etc. This is the only way that we will continue to have leverage, and companies like OpenAI would be forced to open up, if they want to stay at least somewhat competitive. I think their closed-sourceness is a very short-sighted decision atm.
*Should because I don’t think LLMs as technology pose any substantial extinction risk to humanity, despite the obviously marketing rhetoric. Should that somehow change, maybe we shouldn’t. But, I don’t see any technological way to stop their improvement now that the genie is already out of the bottle. Heck, we might have a better chance of developing new tech for eventually enabling AGI decades from now, but only if we do that collectively and openly - only then everyone will have a level field and no side will be more powerful than another.
[1] https://www.semianalysis.com/p/google-we-have-no-moat-and-ne...
> It's no coincidence the "free" service is visibly crippled, while paid API users are not reporting any difference.
I thought GPT-4 wasn't available to free users?
Keep in mind the tweet is specifically about the OpenAI API - They might be updating ChatGPT without telling anyone (although they have release notes)
"Free" was the wrong word to use, I meant to say "unlimited."
(edit: even that's not right. I think the verbiage is more accurate now.)
> It's no coincidence the flat-fee service is visibly crippled, while per-request API users are not reporting any difference. (edited)
FWIW, I am a per-request API user, and... it could be my imagination, but I've had a rather clear feeling GPT-4 got much more lazy out of the sudden, both on OpenAI platform and on Azure, and it fits the timeframe of the recent compaints... so, n=1, make of it what you will.
It's no coincidence the "free" service is visibly crippled, while paid API users are not reporting any difference.
GPT-4 is not available to free users, which really calls the pretext of this comment into question. What is being alleged to have happened, and why is this response unreasonable?
I’m not a very heavy GPT-4 user but I do usually use it once a day or so - it doesn’t appear to have changed noticeably but I’m not paying close attention.
It was badly-worded on my part. And I stand corrected-- I just saw all the nested tweets where people are complaining about the API too!
So many people are noticing a degradation in speed and quality of responses, and not in isolation. Rather than acknowledge this, the question is rephrased to one the responder can rightfully deny-- and suggest no changes have taken place without explicitly saying as much. You normally only see this sort of sliminess from politicians and executives on the witness stand.
> Is anyone else noticing significantly downgraded GPT-4 capabilities today? Seems like OpenAI updated the model, and results aren’t as good as before. [mentions API in a child comment]
> The API does not just change without us telling you. The models are static there.
> This is good to know. That means GPT-4 has been static since March right? 0314?
> Correct
Never ask questions to which one word suffices as an answer.
Collective confusion in the thread suggests something has changed, but the most OpenAI will attest to is that the API is unchanged and the models are static. And this may well be true, but rather than admit "...but we were fucking with the middleware/parameters" they took a firm position on a strawman argument and ignored everybody who followed with more-direct questions. Except this guy:
> I've noticed inconsistency with certain prompts performance. Is that just the non-deterministic nature of the API?
> Yes
Oh, ok. It's because the fucking API is non-deterministic that code that has worked both reliably and predictably for everyone now runs like shit for everyone. For fuck's sake, you can get better answers from a Magic 8-Ball. This guy even made the mistake of presenting an answer he'd believe for the respondent to feed into. He might as well have asked if inconsistent performance was because of the war in Ukraine.
"Logan.GPT" must moonlight as a fortune teller. He's only responding to people foolish enough to ask the wrong questions.
ChatGPT4 feels (is?) typically faster, and the results feel ... above 3.5 turbo but below what 4 was. The API seems exactly the same to me, but you cant do an apples to apples test due to variance between replies.
I dont really doubt they scaled the chatGPT4 model down a little to try to save costs with the plugins and increased usage
I haven't noticed any degradation or unwillingness to correct errors in the paid web version, but I have noticed an increase of prefixing most output with some variation of "it's a complex issue..."
Edit: Having recently gotten access to plugins and the browsing mode, I have noticed that the quality of the response is better when not using either.
Suppose OpenAI were an evil organization. What incentive do they have to lie about GPT-4 being static?
They don't have to be "evil". They have incentive to want to spend less on compute, but also to not let people know that they are reducing the compute per call.
Keep competitors in the dark, keep money flowing in, dismiss all the complaints and lawsuits that could come from their changes, dominate the ongoing discussions about AI regulation (if outsiders can't even tell whether the model is changing or not, who would listen to them?).
Like any for-profit: lower loss / higher gain?
> It's no coincidence the flat-fee service is visibly crippled, while per-request API users are not reporting any difference. (edited)
If the model (gpt4) is unchanged, but they tweaked their system prompt for it on the chat interface, this is what you would see.
>You can literally see this reflected in the logs they forced retention of.
Poetry.
> Liars running a black box-- what could possibly go wrong? We need regulation!
"liars running a black box" is the definition of government regulation sheesh.
You're generally right but one of my nth ones was Brisbane to Cairns over the reef during the day. I'd fly it again and again.
One of my favorites is landing in Rio, the landscape is spectacular and then you get Christ the Redeemer as the cherry on top of the cake.
I feel this for a lot of things, but I love flying and I've probably done it dozens of times. I know people who travel for business and that sounds like it'd kill the magic, but I still love it. Earbuds + laptop! Isolated for hours! They bring food _to you_! You're miles up in the sky!
Whenever I’m flying over an ocean on a clear moonlit night it still very much feels like magic!
And now it's even gonna smell like bacon ...
The fat of dead pigs, cattle and chickens is being used to make greener jet fuel
This should always be the first explanation when you're speculating about something involving ChatGPT or Bing Sydney, if you aren't directly using the API and comparing temp=0 responses to identical canned prompts. There are lots of smaller models & filters and prompts involved, all of which are much easier to update than the behemoth underneath, and which do get updated frequently without notice. People jump the gun constantly on assuming it's the biggest model which changed.
What is your hypothesis for the layout/diagram of models feeding the gpt-4 output for chatgpt/gpt-4 vs api/gpt-4?
Which one? The response wasn't clear on this - they could've meant gpt-4, or gpt-4-0314, or both. gpt-4-0314 is the "pinned" model that is not supposed to change, so I can believe it didn't change. gpt-4 is supposed to receive updates, so... are they saying it didn't?
Note that API users using Bring Your Own Key tools/chat frontends probably default to gpt-4, and not the pinned gpt-4-0314.
Disappointing that GPT seems to be getting more Bard-ish, rather than the other way around. Bard would refuse the request for the simplest of things ('what is kanban' got a "As an LLM I cannot..." response on its first day), which was one of the main things that gave a bad first impression of it. Now, perhaps in an inevitable game of corporate cover-your-ass, OpAI also seems to be going the same way.
Are there any 'complex tasks' you've had refused you can share?
That sounds like it could just be a cap applied to free users. That's hardly the same as a new model.
I subscribe and I use GPT-4
He specifically says the models are changing all the time in ChatGPT
https://twitter.com/OfficialLoganK/status/166447660465806951...
It has and so has the YouTube algo. You can try the verbatim setting with search to see that there has been some monkeying going on but even that doesn’t fix everything. Maybe that’s due to spam sites but it just seems less effective.
It has and its gone downhill alot in the past 4-5 years. This is due to the need to keep an ever increasing stream of income flowing to google with a mostly static number of users and "fixing" this issue by piling more ads into users search results.
Then just use an Ad Blocker. The actual problem is SEO spam though.
> glean
I think the word you're thinking of is the noun "gleam" which means a kind of lustrous shine, rather than the verb "glean" which means to harvest the remainder of something or to collect in small parts.
It’s magic! But it’s also wrong a lot. These days if I ask anything important I’ll also have to Google the answer and I’ll also ask chatgpt if it’s sure about some of the details
The dev refers to the model used in API access.
Three is no ChatGPT-4, just ChatGPT and GPT-4. The tweet is about GPT-4.
> just ChatGPT
There's obviously a ChatGPT-3.5 and a ChatGPT-4. Actually 4 different ChatGPT-4's: Normal, Browsing, Plugins, Code Interpreter.
Generative AI is cool, but ChatGPT is a clear incremental improvement over previous GPT releases, and is not the advent of AGI like many of its proponents here and elsewhere are saying. Hallucinations are an important indicator that the language model doesn't have anything like "understanding" and is thus no closer to AGI than earlier models.
It might be better for the corpus to be expanded with higher quality text that is produced on demand from contractors. There are also large pools of data yet to be accessed. One giant source would be podcast transcripts but there are many others.
> higher quality text that is produced on demand from contractors.
Who, in turn, will use LLMs to generate that text.
I think this would be cost prohibitive, though maybe not, however -
it could just save every answer its given and scan text for it. If they're a match it could just not index it, right?
Doesn't work if more than 1 bot exists...
there's still the pre-2023 data to train it on ... and then augment with handpicked stuff.
Is it true that the safeguards are considered part of the model? I had assumed that the "safeguards" that limit certain types of responses in ChatGPT were separate from the actual language model.
My understanding is that most of the safeguards are inside the model, but there are some safeguards that are outside the model. In particular if you ask the API to generate copyrighted data it will, but then the connection will mysteriously break after the first few words which I assume is a separate system watching the responses.
It seems to me that there’s lots of room to change stuff that profoundly affects the range of responses without altering the base model. The prompt template alone seems like a place outside the model where we’ve seen safeguards get implemented, and other stuff that affects the usefulness of a model’s responses.
ChatGPT is not the same thing as the underlying model. ChatGPT is just a UI over the model. The tweet was about the model.
He said the API hasn't changed. But what about the Chat website?
Exactly my question... I wonder if the pre and post processing of the chat interaction isn't what is driving the perceived differences.
I wouldn't read too far into the tweet. They do tell us when it changed, and the bottom of the page clearly states : ChatGPT May 24 Version
The release notes are producty and not very technical so it is difficult to tell what actually changed.
The inhibitive ("moderation") model is separate from the generative ("chatbot") model. They work in tandem.
My immediate gut reaction at the original post about GPT-4 get SUBSTANTIALLY worse was that...
It didn't. People were just noticing LLMs still have a long way to go after using them more.
I was shocked going through the thread that all of the popular comments were confirmations that it did, in fact, get MUCH worse.
It's nice to see from OpenAI that it didn't...
Aren't there archived transcripts with prompts now?
Seems like we need "model transparency" and log implementers to flag drift, a la RFC 9162 / Certificate Transparency.
I guess we could go back through our histories and resubmit the same prompts a few times. Wouldn’t be a fair comparison since we only have 1 sample from the old version.
My understanding, is that ChatGPT... has a prompt that goes before your question (and previous answers) that set up the stage for how it should respond. It could be that the safeguards they have put in place, sit in this prefix prompt state... rather than GPT4... and that you would get the normal answers via the api rather than via ChatGPT.
The employee is saying the API version didn't change, but the Chat web UI did definitely change.
There's no way. I've been using it basically every day since it launched and it was only the first time ever I recently got a "that's too complex for me to go into detail, here's an overview" response.
The moderation endpoint is separate for the API so I’d imagine it’s the same for chat. The model could be the same while they change:
Moderation
Temperature / top-p
System prompt
Some other internal system we aren’t aware of
Right but people are complaining about their experiences with ChatGPT.
If you say, "ChatGPT has gotten noticeably worse" and they respond "nothing about our APIs have changed" then it would be reasonable to interpret that as them saying that nothing about ChatGPT has changed.
When in reality many things might have changed about ChatGPT.
The tweet is a response to an API user saying GPT-4 got worse, and the tweet is clearly talking about the API and the model - the context of the HN megathread mostly talking about ChatGPT isn't present in the tweet that's being replied to.
There's a lot of confusion around this, but nothing appears to be caused by doublespeak to me. Confusion about which GPT/ChatGPT anyone is reporting about has been pretty ubiquitous for a while.
He says "the models are static", which means that, if you keep using the same model, you will keep getting the same results.
Which is not what anyone else is referring to when they say that GPT-4 has gotten worse.
What about the API user he's directly replying to in the tweet, who asked if the model changed?
Then the HN title is grossly misleading.
Here's an example of a "technically true" but deceptive statement: he says that he has no financial interest in OpenAI. But there's no way that someone couldn't be the head of a huge, influential organization and not be able to use it to obtain power. For instance: at this point, Microsoft pretty much owns it (they have 100% access to the models and could cut it off at any time). Is it merely coincidence that Microsoft agreed to purchase power from the fusion company he backed?
Another example: I've noticed that a lot of times I'll get "network errors" while chatting with ChatGPT. However, this is while I'm SSH'd into a remote machine with zero latency. I think they realized that they could shift the blame from their server capacity issues to "network errors" because that made the consumer feel like it was their fault, not OpenAI's.
Another example: One of the developers of EleutherAI talked about how they tried to implement OpenAI's original model based on their paper. They were struggling to get it to work. So they actually talked to researchers who wrote the paper and discovered that a lot of what was in the paper wasn't what they did at all.
Worst of all, he is trying to create a crypto token. That would disqualify almost anyone in my book.
A recent bit is about his push for AI regulation for safety. but the real reason is probably to build a moat to protect his business.
Call for regulation then threaten to pull out of EU because of regulation?
ChatGPT isn’t just using plain GPT-4, it’s using a specific pre-prompt and possibly additional fine-tuning (which I’m not sure about) compared to the plain GPT-4, available through API.
Someone saying ChatGPT4 is just saying in shorthand, the ChatGPT model based off GPT-4
There is also 100% some regex-style filtering logic which kills the output if it matches certain criteria.
It kicks in for famous book openings (e.g. tale of two cities), but only in English, regardless of what prompt you use to request it - so not part of the model at all but rather a filter on the output.
No idea if it's used for other purposes.
Lots of evidence pointing in that direction. Suppression of logic (censorship) leaks into other areas of cognition
Yes, so just like Apple's real value is from Foxconn's manufacturing factories which they don't own, OpenAI's real value is from Sama, the Kenyan firm that did the RLHF annotating.
I'm wondering what the legal agreement is for Google Classrooms. In elementary schools across the world, kids write 3rd grade essays in history class or whatever and the teacher grades them. Those essays and grades are all in a database. Does Google have the opportunity to train Bard across that dataset?
i'm not in the AI world but i've been curious how the level of effort compares between inventing and writing the model vs finding, tagging, and curating the training. Is creating the training data analogous to inventing a complete schooling curriculum from scratch? I bet that takes a long freaking time.
Much of GPT-4's training data was made by GPT-3.5. Info on the synthetic data in GPT-4's pretraining is starting to leak, including a reference to "Synthetic-Data(2)" in a recent OpenAI paper.
In the near future, if not already, practically all training data for state of the art models will be synthetic.
In the words of OpenAI CEO Sam Altman, they have "boostrapped" and are "past the synthetic data generation event horizon."
Of course you can't really compare it to a real senior engineer, but it has a lot of traits that come close or even surpass seniors:
1. It picks very good variable names
2. Clean, nicely structured code
3. Vast amount of knowledge where even a senior engineer still needs to look up stuff (I don't know about you, but I have a terrible memory. ChatGPT clearly has excellent memory about API's, regexes, etc.). It's the "I don't even have to look at the docs or StackOverflow for this" type of knowledge.
4. It's fast, like very fast, like "I don't even have to think about it and just type it out perfectly like a maniac".
I'd love it if junior engineers organized code as well as ChatGPT. I'd also love it if all engineers stopped mixing spaces and tabs.
That can mostly be fixed with conventional prettify tools. No modern ai required
Add a linting step to your CI to catch this stuff?
Do you have a style guide?
ChatGPT, on spaces vs tabs: https://chat.openai.com/share/b2b0be49-54f9-4f73-a753-3edbd1...
It has depth as well. You just need to copy in the relevant docs.
It’s writing fault tolerant data scraping scripts for me I don’t know how to write myself.
That is why most developer jobs are safe. Tasks from stakeholders often contains only a title.
Rubber ducking in the age of WFH is definitely a use I've found for it.
i don't use it that much for code but when i do it's for a specific function or method. I describe the inputs, the logic i want, and the outputs i need. So basically i have it worked out in my head I just let chatgpt type it out for me. Any mistakes it makes are pretty easy to catch when using it this way.
"Language as a tool of thought" is a technology we often neglect we are equipped with.
This is essentially what rubber ducking is.
Asking it directly is a good way to get a response containing version numbers that look correct, but not a good way to actually indicate which versions it's been trained on.
A better method would be to look at the language features it's using and infer from there. Or, better still, look at which versions were out when the training data was collected (which I believe is September 2021, don't quote me on that).
Is there a risk of it hallucinating what new features could do instead of basing information on having seen them before?
It does not actually mean much. These chatbots can easily output JavaScript using var where it absolutely should not. It being aware of the latest standards does not mean it can properly use the new syntax or features -- it emits a form of whatever crappy/legacy code it was trained on.
Oh that's interesting, I didn't think about asking it directly about what version it thinks is current. I wonder if it can hallucinate here?
It's really irksome when it tries to use functions which were renamed or removed. This could be detected automatically (and possibly remapped in some cases)
I got mostly the same versions as you, both on chatgpt3/4, using english.
If you ask it in other languages the minor version also seems to change often.
Finnish gives you dates, and code-davinici-edit-001 is 1 major release behind on almost everything.
Tässä on luettelo ohjelmistojen viimeisimmistä vakaiden versioista:
Python: 3.9.5 (30.4.2021) JavaScript: ECMAScript 2021 (23.3.2021) ECMAScript: ECMAScript 2021 (23.3.2021) Java: JDK 17 (28.9.2021) C++: C++20 (20.2.2020) C#: .NET 6 (8.11.2022) Ruby: 3.0.2 (24.8.2021) Swift: Swift 5.5 (20.9.2021) Go: 1.17 (16.8.2021) Rust: 1.54.0 (27.5.2021) TypeScript: 4.4 (28.7.2021) PHP: 8.0.9 (29.7.2021) Kotlin: 1.5.31 (26.8.2021) Scala: 2.13.6 (17.2.2021) R: 4.1.0 (18.5.2021) Perl: 5.34.0 (30.5.2021) ... Sinatra: 2.1.0 (10.4.2021) .NET Core: 6.0 (8.11.2022) ASP.NET: 5.0.10 (19.8.2021)
All it is doing is responding with things that look like accurate languages and version numbers. They may be accurate but are absolutely not representative of what it has been trained against.
But this is just a plausible looking answer, it is still a text generator, this has no bearing on reality per se.
i just asked it "what's the output of python -v" and the little sample code reported python 3.9.2.
I got:
Python 3.8.5 (default, Jan 27 2021, 15:41:15)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
And a lot of junk text.Yeah, maybe it is "wearing a thong" since it's developers were so concerned about social acceptability.
I would say though that it is no mean feat for the Emperor to get away with being naked in public, it takes power and the ability to wield it.
Similarly ChatGPT has a few competences that add up to people perceiving it is able, one of which is the ability to come up with plausible and satisfying answers whether they are right or wrong (often using shortcuts) and another one is getting people to engage with these answers.
>It is the second most intelligent thing on the planet at creating text, even with its flaws.
This also puts it in last place.
By this, do you mean 'the second most intelligent thing' after humans?
A dog understands body language and subtle cues, participates in social hierarchy and community, has self-reflection, dreams, etc.
> you have to admit the emperor is at least wearing a thong
And a lot of people are getting turned on by it.
"More often than not" is compatible with "correct 70% to 80% of the time".
Your previous comment was a bit confusingly worded, I, and presumably the commenter you just replied to, read it as "more than 70-80% of the time" it struggles; rather than your intended meaning of it struggling to do better than correct 70-80% of the time.
Just ignore this fella, GPT-4 is massively useful. But it takes intelligence to get intelligence out of it.
I think this is our path - become good at working with AI, for jobs or if there are no jobs, to directly support ourselves. Supposing we will be able to use tools and AI to build things and make our own means.
The tweet is an OpenAI employee who is responding to people who have fallen out of love.
I've seen transcripts of people interacting with ChatGPT who were obviously seduced by it and in a very giddy state, having so much fun because ChatGPT was playing an extended "game" with them that it didn't bother them at all that ChatGPT was spouting wrong answers.
A major complaint I've had about the social sphere is that I seem to get the same result if I am 20% right or 50% right or 80% right or 95% right or 99.8%, it is just exhausting and I can never be good enough and I'm frankly envious that people see more of a glimmer of light behind that thing's "eyes" than they do behind mine.
The core thing about neurotypicality isn't so much that they get the wrong answers but that they get the same answer whether it is right or wrong. For a long time I thought the basis of the "language instinct" is a derangement about reasoning with uncertainty that causes the grammar representation to collapse into a low-dimensioned subspace which is learnable with a limited amount of data. I wouldn't be surprised at all if other animals could beat us at rock-scissors-paper or poker if they could understand the rules of the game. The success of LLMs might give us some insight in this area although they are working with so much more data that Chomsky's old "poverty of the stimulus" argument might not apply.
It's more that the balance in usefulness vs waste of time shifts into the other direction after a few times spending long time getting it right and just doing it yourself anyway.
I've spent 20 years explaining my code to a rubber duck and getting great use out of that duck. No one will ever convince me that it's less useful now when the duck actually has read the Internet and has useful suggestions and ideas back to share. Even if it makes shit up on an occasion.
Maybe the reverse is true? To provide faster performance they dumb down gpt4 (quantization?) that serves the chatgpt UI but the API is full power albeit slow.
I think it’s just prioritising ChatGPT.
The reason that’s my guess is that the API is occasionally much faster and then goes slow again. It’s a bit all over the place which leads me to suspect it’s based on demand
It cuts off in September 2021, supposedly - that's why I pointed out the month of the relevant sqlite release.
Think of GPT like a document simulator. If 90% of documents on the web talk about an old version of a library, then GPT-4 is likely to use that old version.
When I asked it write me a Chrome extension, it used Manifest V2, which was already slated for deprecation. When I asked for a V3 extension specifically, it was happy to comply. So even if a new release came out before Sep 2021, if it hasn't had time to become the dominant version in code examples around the web, GPT-4 may still use older versions.
Maybe more pay attention to my comment,
> super major
Also, it sounds like you have no idea that feeding the API back what it has written improves it tenfold for complex tasks.
Enjoy your 90 second requests though.
by preventing it from outputting one of the most famous sentences in English literature? well done lads
hilariously if you throw s/worst/blurst/ on to your request it can output the forbidden dickens n-gram. does this constitute a jailbreak? or is the entire thing utterly emblematic of the modern technolegal mess of things since dickens is squarely and quintessentially in the Public Domain
Has OpenAPI themselves announced or someone has objectively and verifiably proved that the UI and API invoke two different models?
Not directly, but the model version displayed in the UI changes every week or so while the api version has apparently been stable.
It still takes a surprising amount of labour and artistry to get an AI to give you exactly what you want (see also: Pareto principle). Consumer expectations scale proportionally to technological progress, so demand for premium assets and brand differentiators won’t be going away anytime soon; the state of the art has advanced but stock photo business will advance right along with it.
I’d say ai art lets me get to a point where I can justify paying someone. It’s hard to understand a game without decent visuals, even if they’re filled with ai turds.
Thanks!
I went through a bunch of old prompts and the response was definitely worse now than it was then.
Since it's not determinstic though, it's hard to draw conclusions. Especially since the sample size was 10ish.
You're right of course that some randomness is inherent, but you can adjust the "temperature".
With a low enough temperature you get essentially the same output every time, with a at most just minor words swapped.
That’s not really an issue with AI in general, you can make it deterministic by tuning some parameters if you want (temperature in chatGPT, passing a seed in midjourney etc)
This sounds exactly like what is happening now that you mention it.
You're supposed to click "new conversation" when changing topics.
> Should I start a new conversation with you for a new topic or can you switch to a new topic in the same conversation just fine?
> You can continue within the same conversation for a new topic. There's no need to start a new conversation. Feel free to ask about a different topic, and I'll do my best to assist you.
Just like with a real general intelligence!
Hm. Maybe they've backported some nanny code from Sydney?
AFAIK it's pretty standard practice not to expose the "raw" LLM directly to the user. You need a "sanity loop" where user input and the output of the LLM is checked by another LLM to actually enforce rules and mitigate prompt injections, etc.
> Collective confusion in the thread suggests something has changed, but the most OpenAI will attest to is that the API is unchanged and the models are static. And this may well be true
Here's the thing. Re-read the exchange you quoted:
>> The API does not just change without us telling you. The models are static there.
>> This is good to know. That means GPT-4 has been static since March right? 0314?
>> Correct
Does that "Correct" mean all GPT-4 models have been static since March, or does it only cover the gpt-4-0314, which is a single, specific model? gpt-4-0314 is the static snapshot model, hence 0314 in the name; it exist as a stable base, while gpt-4 was intended to be updated over time.
So I feel OpenAI may be dodging here. That "Correct" may just mean "the gpt-4-0314 model was not updated since March 14", which is, like, the very reason this model exists in the first place.
Important point: if you're using API pay-as-you-go access, unless you wrote the very tools you use with that API, you're most likely using gpt-4, and not gpt-4-0314.
> if you're using API pay-as-you-go access, unless you wrote the very tools you use with that API, you're most likely using gpt-4, and not gpt-4-0314.
Do you know of any alternative ChatGPT UIs like chatbotui.com or typingmind.com that let you specify the model gpt-4-0314 ?
This feature alone would be enough for me to switch.
And anyway my program used the api with a temperature of zero and it still started acting erratically. They are not telling it straight.
Yeah I’m having the same experience. The replies have gotten faster in the last few days, at the expense of quality.
It’s noticeable because I used to be able to read each word as it was printed from the GPt-4 output, but now it goes far too fast for me to keep up with.
Kinda sucks to not have transparency at all into the black box, they’ve strayed so far from “open” at this point that it’s comedic
Components which are likely there and which could be affecting quality: the invisible-to-the-user 'system prompt' (probably with few-shot examples), retrieval from history (not necessarily exclusively your own, as a way to augment few-shot), cascaded models with a very cheap 'turbo' model to try to answer first, possibly regular finetunes of the main chat model (note that OP only specifies the API model hasn't changed, but the live chat models seem to change frequently), and a post-generation filtering model trying to reject offensive outputs.
We know MS Bing Sydney is doing at least 3 of those (prompt, cascade with Megatron, and finetuned rejection classifier for post-output filtering) on top of its GPT-4-finetune, so it's not a stretch to figure that OA is doing similar things.
That’s pretty easy to avoid if you have them work in an office on company provided machines on a network that blocks access to LLM sites. Not exactly rocket science here.
The dataset of internet content pre-2022 will be regarded in the future just like low-background steel. Any content generated post the release of the first generally accessible LLMs will be considered radioactive.
I'm of the same mind, and I believe they're keeping their ear to the ground to address any jailbreaks and loopholes. I have tricked GPT-4 into spitting out text it shouldn't have, only to have it dance around the same prompts less than 24 hours later. This "the model is the same" response seems like a deliberate deflection meant to mask the mechanical turk that ChatGPT is becoming.
ChatGPT is also the fine tuning and prompting of the model. It’s a distinct set of weights from “raw” GPT-4/etc, it’s just not a foundational model.
No, ChatGPT isn't a model. gpt-4 is a model, gpt-3.5-turbo is a model, text-davinci-003 is a model. ChatGPT is a user interface.
It has a very basic prompt on top of the existing models. There is no additional fine tuning involved.
I could not figure out what everyone was talking about yesterday as my experience with gpt4 has not degraded at all, but I'm using a third party client via API.
I tried comparing the API response against the chat on the same question a few times. There isn’t a huge difference but I’d pick the responses from the API over the ones from chat. Hard to say though, could be RNG.
Which 3rd party client, if you don't mind me asking? I'm looking to move, and the space isn't mature enough yet for there to be a clear leader.
That is not what that means
What made you think that the person he replied to meant to ask about the GPT-4 model/API instead of ChatGPT-on-GPT-4? The latter is what a lot of people refer to when they say "GPT-4." People do not always communicate with infinite precision.
The tweet specifically refers to the model "Seems like OpenAI updated the model" and if you look at their profile you'll see "CEO @HyperWriteAI, @OthersideAI - I make AIs do the impossible." as well as a pinned tweet announcing that they're building a personal assistant/agent on top of this stuff.
I didn't check this part but the OpenAI employee very likely follows them and knows all this context.
That is the moderation API, which you can actually block from the front-end (with javascript). Look up "ChatGPT DeMod."
Are you sure? I ran it by the moderation endpoint and both the partial output "It was the best of times," and the full thing give me e-06 or lower. This also doesn't give you any notification like other filters do.
I always forget how to see this stuff on the network tab of devtools since it's websocket obfuscated or such (befitting openai), but i have the feeling they're just killing the process directly on their end
Maybe I could get chatgpt to write those pipeline steps for me. :)
You can and you should. It’s this kind of busywork I farm out to GPT these days and it’s great at it. I suck at it, so it saves me an hour to focus on what I’m good at!
Well GPT 4 consistently defaults to returning 2 space indents, so clearly it has an opinion on the matter.
Yes, good point, I suppose we would have to weigh the probability of that against it fudging a number. I would assume inventing a new feature is harder, but who's to say; all this LLM stuff comes down to probability.
Ha! So far it seems four posters here have asked similar questions and got four different python versions.
I don't know if we can treat the gpts in this way and expect reliable answers. We just get pretty good answers right up to the point that we don't.
Agreed, I think this is where its "stochastic parrot" nature shines through. All those version numbers are probably equally well represented in internet text.
This is our path yes. All paths end, eventually, but oh well.
I don't know. The value of rubber-ducking is that you have to break the problem down into very simple terms to explain it. It's the explaining it part that is magic. That value goes away (or is greatly diminished) if the rubber duck responds with anything other than a request for further explanation.
You have to break down a problem to GPT just like you would to a human, and if it misunderstands you or asks you, you'd have to elaborate just the same. It's modeled after us, after all. The rubber duck is a stand-in for a human. GPT also is. But just a vastly better one.
My point (which may not have been clear) was that when I explicitly told it that the versions of SQLite I'm using supports "returning," it still returned incorrect code. It did not use "returning", and repeatedly, even after being corrected, grabbed the row count instead of the column value.
I understand that it's a language model - my point was that in my experience, the depth of training data just isn't there for alternative implementations. It can do the things you ask one way, maybe two or three, but if you know how you actually want something done you're probably better off writing it yourself (for now).
GPTs are not trained on themselves, and what little knowledge they have of their own capabilities is limited to the discussion of it in their training set. This is a hallucination.
I might believe you if we were 'talking' directly to the model, however this is a chat box and clearly something is aware of usage questions and how to answer them. If you've been granted plugin access, enable some and ask it what plugins are enabled and how to use them, 'it' will tell you.
Well, if you were talking to me about something, and midway with no warning or other signal, changed topics I would also get confused.
With humans there is usually a non-verbal signal letting the other person know you changed topics.
> if you were talking to me about something, and midway with no warning or other signal
This happens all the time in a social group setting though. Not much confusion ensues.
Not if you're my wife.
Sometimes I spend several minutes of a conversation trying to figure out which thing she's talking about, because we've already covered like twenty different topics in the last ten minutes and she seems to just switch around at random and with no warning.
Sometimes I get enough context that I can make that swap or stack pop, but many times not.
Could be related to the system prompt.
We've been using https://github.com/cogentapps/chat-with-gpt for a while, and overall happy with it, but its development is mostly dead.
Of course.
The part I was responding to was this:
> when the duck actually has read the Internet and has useful suggestions and ideas back to share
I think that if it does that, it's legitimately less useful as a rubber duck. I'm not saying it isn't useful -- it's just not useful as a rubber duck anymore.
Similarly, but not quite, I often find browsing new submits on HN more useful than simply going through the front page. Just because, uh, front page is decided by statistics, that is, the community's neurotypicals
It's clear that it's aware that it's a chat bot, it's not clear at all that it's aware of usage questions and how to answer them. Its answer being wrong seems to be evidence that it isn't.
I think wives believe their husbands can read their mind, and they get annoyed when that doesn't happen :)
But yah, this is exactly what I mean - without some cue, neither humans, nor machines, can tell when you switch conversation.
304 Comments: