The future of generative AI lies in open source

Open source generative AI concept image showing a digitized human brain elevated over a GPU and circuit board.
(Image credit: Getty Images)

The open source ecosystem has long been the backbone of the global technology industry, and in the age of generative AI, the situation is no different. Some of the most impressive models out there are open source, such as Mistral AI and Meta’s Llama.

With the AI industry growing at an astounding pace, the open source community is well placed to contribute to - and guide - whatever this next generation of technology brings.

Speaking at a press briefing at KubeCon 2024, Jim Zemlin, executive director of the Linux Foundation, touted the wide range of areas in which open source may be able to assist in the development of AI.

“It might be easier to think about the goal of open source more broadly in generative AI by looking at it from a full stack,” Zemlin said.

Zemlin worked his way from the CPU level up to the data level of open source and generative AI’s relationship, pointing out the notable headway open source is making along the way.

On the baseline computing level, the Linux Foundation has created the ‘Unified Acceleration Foundation’ to look a little more closely at the role of open source and GPUs, while he also mentioned the forward momentum of open source at the foundation model level.

Perhaps most notably, Zemlin said he believes that open source might be the answer to some of AI’s most pertinent problems, such as hallucinations, security risks, and distinguishing between real and AI-generated content.

“Sometimes the answer to problems in tech is more tech, and a lot of people are skeptical of that,” Zemlin said. “But in this case, I think it's true.”

“If you look at some of the things around large language models, around AI safety and security, I think this is an area where we're seeing some good starts,” Zemlin said.

Speaking to the specific areas in which the open source community could help develop tools to track problems, Zemlin said projects are already underway to assist developers with ‘unlearning’ in a bid to fine tune AI models.

“We're already seeing some of these tools in our Linux Foundation AI big data project,” he added.

Zemlin also drew attention to open source’s commitment to the Coalition for Content Provenance Authority (C2PA), a project which builds on the efforts of the Content Authenticity Initiative (CAI) in establishing a framework for identifying AI-generated content.

The open source ecosystem can do more to support AI development

Zemlin warned, however, that the open source community can be more proactive with regard to AI development. The ecosystem should be more vocal about the role it can play in underpinning safe and responsible development, he suggested. 

“[There’s] a real opportunity for open source to do more,” he said.

RELATED WEBINAR

A webinar screen with contributor images, with discussions around how to tame IT and security complexities with Cloudflare

(Image credit: Cloudflare)

Address IT and security challenges

Speaking to ITPro at the conference, Oleksandr Matvitskyy, senior director analyst at Gartner, echoed Zemlin’s comments regarding the role of open source in the future of generative AI development.

Closer collaboration with the ecosystem and ensuring open source development is prioritized should be a key focus for enterprises, regulators, and governments alike moving forward, Matvitskyy said.

“I think anything can be done with open source,” he told ITPro.

“I think [it] has to be every government's, every regulator's priority to make sure that AI remains open source,” Matvitskyy added.

Prevalence of proprietary data could hamper progress

Roadblocks are standing in the way of open source AI development approaches, however, specifically with regard to AI training. The last 18 months have been fraught with instances of hallucinations and security issues. 

Matvitskyy pointed out that these issues are particularly visible in the operation of AI models.

“They still hallucinate in their outputs,” Matvitskyy said, “they have no data to learn on - everything is private, everything is protected.”

Companies often hoard their data, thus limiting the amount of open data available for the training of AI which, fundamentally, is the only way that AI models will develop past whatever their current level of complexity is.

Mattvitskyy said that around 60% of the data that companies are holding on to is probably “not really important” and could be released into the public domain for the training of AI models.

“They should be open and the companies should get money … for innovation, for what they actually do, not for what they created thirty years ago,” Matvitskyy said.

George Fitzmaurice
Staff Writer

George Fitzmaurice is a staff writer at ITPro, ChannelPro, and CloudPro, with a particular interest in AI regulation, data legislation, and market development. After graduating from the University of Oxford with a degree in English Language and Literature, he undertook an internship at the New Statesman before starting at ITPro. Outside of the office, George is both an aspiring musician and an avid reader.