Source: marcos alvarado via Alamy Stock Photo
COMMENTARY
Artificial intelligence (AI) is rapidly altering nearly every aspect of our daily lives, from how we work to how we ingest information to how we determine our leaders. As with any technology, AI is amoral, but can be used to advance society or deliver harm.
Data is the genes that power AI applications. It is DNA and RNA all wrapped into one. As is often said when building software systems: "garbage in/garbage out." AI technology is only as accurate, secure, and functional as the data sources it relies upon. The key to ensuring that AI fulfills its promise and avoids its nightmares lies in the ability to keep the garbage out and prevent it from proliferating and replicating across millions of AI applications.
This is called data provenance, and we cannot wait another day to implement controls that prevent our AI future from becoming a massive trash heap.
Bad data leads to AI models that can propagate cybersecurity vulnerabilities, misinformation, and other attacks globally in seconds. Today's generative AI (GenAI) models are incredibly complex, but, at the core, GenAI models are simply predicting the best next chunk of data to output, given a set of existing previous data.
A Measurement of Accuracy
A ChatGPT-type model evaluates the set of words that make up the original question asked and all the words in the model response so far to calculate the next best word to output. It does this repeatedly until it decides it has given enough of a response. Suppose you evaluate the ability of the model to string together words that make up well-formed, grammatically correct sentences that are on topic and generally relevant to the conversation. In that case, today's models are amazingly good — a measurement of accuracy.
Dive deeper into whether the AI-produced text always conveys "correct" information and appropriately indicates the confidence level of the conveyed information. This unveils issues that come from models predicting very well on average, but not so well on edge cases — representing a robustness problem. It can be compounded when poor data output from AI models is stored online and used as future training data for these and other models.
The poor outputs can replicate at a scale we have never seen, causing a downward AI doom loop.
If a bad actor wanted to help this process, they could purposely encourage extra bad data to be produced, stored, and propagated — leading to even more misinformation coming out of chatbots, or something as nefarious and scary as automobile autopilot models deciding they need to veer a car quickly to the right despite objects being in the way if they "see" a specially crafted image in front of them (hypothetically, of course).
After decades, the software development industry — led by the Cybersecurity Infrastructure Security Agency — is finally implementing a secure-by-design framework. Secure-by-design mandates that cybersecurity is at the foundation of the software development process, and one of its core tenets is requiring the cataloging of every software development component — a software bill of materials (SBOM) — to bolster security and resiliency. Finally, security is replacing speed as the most critical go-to-market factor.
Securing AI Designs
AI needs something similar. The AI feedback loop prevents common past cybersecurity defense techniques, such as tracking malware signatures, building perimeters around network resources, or scanning human-written code for vulnerabilities. We must make secure AI designs a requirement during the technology's infancy so AI can be made secure long before Pandora's box is opened.
So, how do we solve this problem? We should take a page out of the world of academia. We train students with highly curated training data, interpreted and conveyed to them through an industry of teachers. We continue this approach to teach adults, but adults are expected to do more data curation themselves.
AI model training needs to take a two-stage curated data approach. To start, base AI models would be trained using current methodologies using massive amounts of less-curated data sets. These base large language models (LLMs) would be roughly analogous to a newborn baby. The base-level models would then be trained with highly curated data sets similar to how kids are taught and raised to become adults.
The effort to build large, curated training data sets for all types of goals will not be small. This is analogous to all the effort that parents, schools, and society put into providing a quality environment and quality information for children as they grow into (hopefully) functioning, value-added contributors to society. That is the level of effort required to build quality data sets to train quality, well-functioning, minimally corrupted AI models, and it could lead to a whole industry of AI and humans working together to teach AI models to be good at their goal job.
The state of today's AI training process shows some signs of this two-stage process. But, due to the infancy of GenAI technology and the industry, too much training takes the less curated, stage-one approach.
When it comes to AI security, we can't afford to wait an hour, let alone a decade. AI needs a 23andMe application that enables the full review of "algorithm genealogy" so developers can fully comprehend the "family" history of AI to prevent chronic issues from replicating, infecting the critical systems we rely on every day, and creating economic and societal harm that may be irreversible.
Our national security depends on it.