Home Opinion AI needs cultural policies, not just regulation

AI needs cultural policies, not just regulation

0
AI needs cultural policies, not just regulation


‘Alongside computing power and algorithmic innovations, data arguably are the most important driver of progress in the field’
| Photo Credit: Getty Images/iStockphoto

The future of Artificial Intelligence (AI) will not be secured by regulation alone. To ensure safe and trustworthy AI for all, we must balance regulation with policies which promote high-quality data as a public good. This approach is crucial for fostering transparency, creating a level playing field, and building public trust. Only by giving fair and wide access to data can we realise AI’s full potential and distribute its benefits equitably.

Data are the lifeblood of AI. In this regard, the laws of neural scaling are simple: the more, the better. The more volume and diversity of human-generated text is available for unsupervised learning, for example, the better the performance of Large Language Models (LLMs) will be. Alongside computing power and algorithmic innovations, data arguably are the most important driver of progress in the field.

A data race at the expense of ethics

But there is a problem. Humans do not produce enough digital content to feed these ever-growing beasts. Current training datasets are already huge: Meta’s LLama 3, for example, is trained on 15 trillion tokens, equivalent to over 10 times the British Library’s book collection. According to a recent study, the demand for pristine text is such that we might reach something akin to ‘peak data’ before 2030. Other papers caution against the dangers of public data contamination by LLMs themselves, causing feedback loops that amplify biases and deplete diversity.

Fears of an ‘AI winter’ reflect the relentless race for data in which researchers and industry players are engaged, sometimes at the expense of quality and ethics. A prime example is ‘Books3’, a trove of pirated books widely believed to feed leading LLMs. Whether such practice falls under fair-use policy is a debate for lawyers. What is more disturbing is that these books are being hoarded without any clear guiding principle.

Even if progress is being made, notably thanks to regulation, LLMs are still largely trained on an inscrutable morass of licensed content, ‘publicly available data’, and ‘social media interactions’. However, studies show that these data reflect, and sometimes even exacerbate, the current distortions of our cyberspace: an overwhelmingly anglophone and presentist world.

The absence of primary sources

The notion that LLMs are trained on a universal compendium of human knowledge is a fanciful delusion. Current LLMs are far from the universal library envisioned by the likes of Leibniz and Borges. While stashes of stolen scriptures like ‘Books3’ may include some scholarly works, these are largely secondary sources written in English: commentaries that merely skim the surface of human culture. Conspicuously absent are the primary sources and their myriad tongues: the archival documents, oral traditions, forgotten tomes in public depositories, inscriptions etched in stone — the very raw materials of our cultural heritage.

These documents represent an untapped reservoir of linguistic data. Consider Italy. The State Archives of this nation alone harbour no less than 1,500 kilometres of shelved documents (in terms of linear measurement) — excluding the vast holdings of the Vatican. Estimating the total volume of tokens that could be derived from this heritage is difficult. However, if we include the hundreds of archives spreading across our five continents, it is reasonable to believe that they could reach, if not surpass, the magnitude of data currently used to train LLMs.

If harnessed, these data would not only enrich AI’s understanding of humanity’s cultural wealth but also make it more accessible to the world. They could revolutionise our understanding of history, while safeguarding the world’s cultural heritage from negligence, war, and climate change. They also promise significant economic benefits. As well as helping neural networks scale up, their release into the public domain would mean that smaller companies, startups, and the open-source AI community could use those large pools of free and transparent data to develop their own applications, levelling the playing field against Big Tech while fostering innovation on a global scale.

Examples from Italy and Canada

Advances in the digital humanities, notably thanks to AI, have drastically reduced the cost of digitisation, enabling us to extract text from printed and manuscript documents with unprecedented accuracy and speed. Italy recognised this potential, earmarking €500 million of its ‘Next Generation EU’ package for the ‘Digital Library’ project. Unfortunately, this ambitious initiative, aimed at making Italy’s rich heritage accessible as open data, has since been deprioritised and restructured. Shortsightedness prevailed.

Canada’s Official Languages Act offers an instructive lesson in this regard. Long derided as wasteful, this policy requiring bilingual institutions eventually yielded one of the most valuable datasets for training translation software.

However, recent debates about adopting regional languages in the Spanish Cortes and European Union institutions have overlooked this key point. Even advocates have failed to recognise the cultural, economic, and technological benefits of promoting the digitisation of low-resource languages as complementary.

As we accelerate the digital transition, we must not overlook the immense potential of our world’s cultural heritage. Its digitisation is key to preserving history, democratising knowledge, and unleashing truly inclusive AI innovation.

Clément Godbarge is Lecturer in Digital Humanities, School of Modern Languages, The University of St. Andrews, U.K.



Source link

NO COMMENTS

Exit mobile version