apologize or else: July 2024

Monday, July 15, 2024

Unsustainable? Inconceivable!

It seems that people are just starting to cotton on to the fact that the new wave of giant AI is not very sustainable. Mostly they look at the cost (communications, storage, computation, electricity, water) of training.

But there's another couple of costs being ignored

1. it has taken 32 years (give or take) of the WWW to get to where we have all the material avaialble today, including millions of websites, blogs, scientific and other academic open access materials and wikipedia and so on, as well as huge numbers of photos, songs etc - this represents a massive investment by 100s millions of people over more than a generation.

2. really useful data out there has been curated (a.k.a. wrangled) so that it doesn't have too many lacunae or errors, and may be statistically representative - it may also be accompanied by meta data (describing its meaning, but perhaps also labelling features in the data with meaningful tags - especially useful, for example, in medical images or satellite images of earth, but also just simple stuff like names of people in pictures, and GPS/location data of a photo or movie. This also took both time&effort, but also expertise - humans spent a while using their knowledge, and possibly skills, to add that extra information.

Of course, a special class of data is code - and open source repositories have a lot of that, associated with meta data ("documentation") and labelled (e.g. with commit logs describing bug fixes or features added, by whom, and when)

While they may offer all this data for "free", using it to train an AI is being undertaken lightly as if this is the same as using an image or a blog or a piece of music for entertainment or education.

By absorbing this mass of material into a model, what is really being done is absorbing the prior information that gives more than a slight hint about the model that was in the minds of the users who created the original content. That is to say, their labour is being appropriated, not just the fruits of their labour.

So if you want another common-crawl's worth of data, be aware a lot of people will quite like to be paid for their effort next time around. And can you afford a payroll with 100M expert employees working for 32 years? Really?