OpenAI’s deal with Stack Overflow just landed it a goldmine of developer data — here's why it'll transform LLM development
OpenAI's large language models will now be trained using millions of questions drawn from the developer knowledge site
OpenAI and Stack Overflow have announced a deal that will see the AI company use data from the developer website to train its large language models (LLM).
Stack Overflow is best known for its question-and-answer content which allows developers to ask for help with the particular technical problems they are facing, and (hopefully) get or find a response from someone who has already dealt with something similar.
Using Stack Overflow’s OverflowAPI, OpenAI will gain access to the technical questions and answers contributed to Stack Overflow by millions of developers over the last 15 years, which it will use to improve the answers given by its generative AI tools.
OverflowAPI is a subscription-based API service that gives AI companies access to Stack Overflow’s public dataset to train and fine-tune their LLMs.
The companies said the deal will help OpenAI improve its AI models and will also see attribution given to Stack Overflow community information within ChatGPT.
For its part, Stack Overflow said it will use OpenAI models as part of its development of OverflowAI – its own AI tool - and work with OpenAI to “leverage insights from internal testing” to maximize the performance of OpenAI models.
“The developer community is particularly important to both of us. Our deep partnership with Stack Overflow will help us enhance the user and developer experience on both our platforms,” said Brad Lightcap, COO at OpenAI.
Get the ITPro. daily newsletter
Receive our latest news, industry updates, featured resources and more. Sign up today to receive our FREE report on AI cyber crime & security - newly updated for 2024.
The first set of new integrations between Stack Overflow and OpenAI will be available in the first half of 2024, the companies said.
Why is OpenAI interested in Stack Overflow?
Developer tools have been one of the areas where the relentless hype around generative AI has been at least partly justified, with these tools making coders more efficient through programming suggestions and even by writing code snippets. This is potentially a challenge to something like Stack Overflow.
However, the key to the effectiveness of these tools and others is the data they are trained on, so big AI companies are making sure they have continued access to the best sources of data.
Stack Overflow’s own research suggests that only 42% of developers trust the accuracy of AI tools. Back in 2022 Stack Overflow banned all use of generative AI tools when posting answers to Stack Overflow.
“Overall, because the average rate of getting correct answers from ChatGPT and other generative AI technologies is too low, the posting of content created by ChatGPT and other generative AI technologies is substantially harmful to the site and to users who are asking questions and looking for correct answers,” it said at the time.
Stack Overflows’ own dataset contains more than 58 million human-generated questions and answers across everything from coding and debugging to explaining, testing, reviewing, and brainstorming and includes feedback signals from users and moderators as to how useful an answer is - all of which is super-useful for tweaking an LLM.
Stack Overflow said all products which are based on models that use its data are required to provide attribution back to the “highest relevance posts” that influenced the summary given by the model.
“With the lack of trust being felt in AI-generated content, it is critical to give credit to the author/subject matter expert and the larger community who created and curated the content being shared by an LLM,” it said.
This isn’t the first big AI deal for Stack Overflow
Stack Overflow signed a similar deal with Google Cloud back in February, which involved Stack Overflow adopting AI technology from Google Cloud, and Google Cloud integrating Stack Overflow datasets into its AI tools.
As part of the deal, Gemini for Google Cloud will provide developers with suggestions, code, and answers from Stack Overflow, also using the OverflowAPI.
Developers using Gemini for Google Cloud will be able to access Stack Overflow directly from the Google Cloud console, so they can ask questions and get answers from the Stack Overflow community in the same environment where they manage their Google Cloud applications and infrastructure.
Stack Overflow said it would use Google Cloud as its platform of choice for hosting its public-facing developer knowledge platform, and would use Google Cloud AI capabilities to improve community engagement experiences and content curation processes.
“The use of Google Cloud AI technology is expected to result in an accelerated content approval process and further optimized forum engagement experiences for Stack Overflow users,” the company said.
The first set of new integrations and capabilities between Stack Overflow and Gemini for Google Cloud are due to be available in the first half of this year.
Why the OpenAI deal matters
It’s another sign of how the generative AI market is evolving. While the first stage was building huge foundation models that could wow people with their ability to answer general knowledge questions and make striking if often slightly disturbing art, the challenge now is to make these tools more useful.
That involves training the underlying models on better data so that these tools can return useful information and limit the risks of error or generative AI hallucination.
Having a big LLM might be important, but having the best training data is now vital, too.
Steve Ranger is an award-winning reporter and editor who writes about technology and business. Previously he was the editorial director at ZDNET and the editor of silicon.com.