How much data does AI really need to learn something useful?

Data collection by businesses and governments has been creeping up on us for quite some time. From a cultural perspective, Australians generally expect personal privacy to be respected by governments, and our governments have typically argued on the world stage that democracy and “surveillance societies” are incompatible.

Alongside this cultural norm, many organisations doing business in Australia have already collected huge troves of data about their customers. Few businesses are transparent about the extent of their data collection. Given community unease towards surveillance, lack of transparency is not surprising.

While you don’t need to use AI to extract value from data, there can be some situations where AI is particularly useful, such as when data is not categorised or annotated adequately. Machine learning and Natural Language Processing (NLP) can be well suited to situations where you might need to add additional information to a data record to make it more useful.

For example, a dataset may include tens of thousands of text articles and photos, with no “metadata” or logical structure to help navigate and find content. NLP can be used to analyse topics covered in the articles, summarise content, and potentially deliver improved search results. Machine learning is well suited to tasks such as machine vision and classification of information. Machine learning is already widely used to analyse photos to identify objects, locations, and individual people.

Adding “value” to existing data is clearly a simple incremental step, and very easy to justify for most organisations. Nobody wants to search for a needle in a data haystack. It is perfectly reasonable to use tools that make finding specific data records or documents more efficient. The question is really one of incrementalism.

But for most tasks, an AI system will need very significant amounts of training data to achieve an acceptable level of accuracy.

The tradeoff many leaders will face is AI accuracy. How accurate does the AI system truly need to be?

Fortunately, statistical analysis gives us well established mathematical methods for measuring and answering these types of learning questions. It is possible to train machine learning and NLP systems to reach acceptable levels of accuracy when classifying and analysing an organisation’s data trove. This is already a reality.

Sourcing data suitable for training AI systems is a non-trivial issue. Each machine learning “black box” may potentially need a lot of carefully curated training data, and significant ongoing executive governance to achieve and maintain tolerable error rates.

Bias can easily be introduced with a poorly designed training data set. Budget is likely to dictate training data set size, and hence the accuracy of AI decisions based off that training data.

For example, if you need to be able to train a system to categorise something based on particular attributes, your training data may need hundreds of training examples for each possible value of those attributes. As the number of attributes that need to be considered increases, the volume of training data rises dramatically. Smaller training sets mean higher error rates.

In the real world, developing high-quality training data sets that accurately represent a demographically accurate sample can easily be extremely costly, even for organisations with huge treasure troves of client data.

Categorisation can be imperfect and subjective. When a third-party creates a training data set, you obviously need to be confident in the consistency and accuracy of the information.

Some systems, such as machine vision and NLP language processing, are available as tools based on “pre-canned” third party training, which provides the heavy lifting for the bulk of the training task. But there are significant governance risks.

Many commercially developed AI systems rely at least in part on large scale training data sets provided by third parties. Researchers developing machine vision systems have developed large open source image classification libraries such as ImageNet, which provides a baseline set of 14 million photos organised into 20,000 categories.

Unfortunately, industry reliance on ImageNet has raised some serious issues.

ImageNet, for example, was found in 2019 to contain a large quantity of images with bizarre and patently inaccurate categorisation labels. This inaccurate information was shown by some researchers to skew the outputs of AI systems trained using ImageNet data.

Apart from data quality, the attributes that you might want to be part of the machine learning decision may be also difficult to isolate and label in the training data.

For example, determining a person’s emotional state or gender from a photo or voice recording. Being able to access context may be important to making a decision, and if it is, that is likely to dramatically influence the structure and quantity of training data required.

Within the Australian public sector, there are already some small-scale pilot projects that have been attempted using AI automation. One particular project attempted to implement an AI powered system to analyse customer service style phone conversations. The system ran into two fundamental problems – a small volume of transactions to be analysed, and poor data quality.

In short, while the system was intended to replace multiple people doing a manual process, it was implemented in a situation where the volume of transactions was too low to allow for adequate AI training. Inadequate training data means higher error rates.

So how much data do you need to train an AI system? That will always be the key question.