Who will keep the garbage out of the large language models?

On Senatorial tweets, YouTube captions, and what's soon to happen when audio and video transcripts arrive in force

Mar 28, 2023

Even well-informed people with lots of information resources at their disposal can get things wrong about technology. Senator Chris Murphy of Connecticut, for instance, saw some of the outputs from large language models and declared, "ChatGPT taught itself to do advanced chemistry. It wasn't built into the model. Nobody programmed it to learn complicated chemistry. It decided to teach itself, then made its knowledge available to anyone who asked. Something is coming. We aren't ready."

■ Applause for thinking about the implications of artificial intelligence. But jeers for deeply misunderstanding the technology: It is really important that we approach AI thoughtfully, especially as it will to some extent or another require thoughtful, well-informed regulation -- by people like United States Senators.

■ But artificial intelligence systems (like ChatGPT) are not sentient. Get that part wrong, and there's very little hope of getting the rest right. They are predictive models based upon the information supplied to them as inputs. Much of that information is obtained from the Internet, where lots of useful scientific and technical information can be found.

■ Yet we haven't reconciled ourselves yet with what could end up being a tremendous hazard to these models. We really haven't yet seen the large-scale emergence of audio and video transcripts on the Internet. YouTube has made considerable strides in the direction of automatic captioning, for instance, but there is an enormous volume of audio and video content being produced every day that isn't really being transcribed and made readily available to search engines and language models...yet.

■ That will certainly change. And when it does, transcription content will ultimately be represented disproportionately to its intrinsic value. It's easy to speak at 150 words per minute or faster, but even skilled keyboard users are generally able to type at only about half that speed -- and real, thoughtful composition is even slower.

■ Once the transcription material from Snapchat videos and Facebook Reels and time-filling talk shows makes it into things like large language models, the consequences will be bad. Lots of input sources will be flooded with low-quality content.

■ But the models aren't sentient, so unless the humans who gatekeep their inputs are careful, those models will be contaminated by content that wouldn't pass a Wikipedia test for veracity. Who will see to it that a carefully sourced and edited graduate research thesis means more to the language models than a transcript of whatever nonsense a syndicated bloviator decided to spew on the radio for three hours a day?

■ That doesn't mean the solution requires government regulation. But it does point to just how essential it is that the people who will do the regulating seek to understand what is fundamentally going on. ChatGPT isn't going to "teach itself" anything. But something is indeed coming, and we very well do need to be ready.

Discussion about this post

Ready for more?