How much data does AI really need to learn something useful?

Data collection by businesses and governments has been creeping up on us for quite some time. From a cultural perspective, Australians generally expect personal privacy to be respected by governments, and our governments have typically argued on the world stage that democracy and “surveillance societies” are incompatible.

Alongside this cultural norm, many organisations doing business in Australia have already collected huge troves of data about their customers. Few businesses are transparent about the extent of their data collection. Given community unease towards surveillance, lack of transparency is not surprising.

While you don’t need to use AI to extract value from data, there can be some situations where AI is particularly useful, such as when data is not categorised or annotated adequately. Machine learning and Natural Language Processing (NLP) can be well suited to situations where you might need to add additional information to a data record to make it more useful.

For example, a dataset may include tens of thousands of text articles and photos, with no “metadata” or logical structure to help navigate and find content. NLP can be used to analyse topics covered in the articles, summarise content, and potentially deliver improved search results. Machine learning is well suited to tasks such as machine vision and classification of information. Machine learning is already widely used to analyse photos to identify objects, locations, and individual people.

Adding “value” to existing data is clearly a simple incremental step, and very easy to justify for most organisations. Nobody wants to search for a needle in a data haystack. It is perfectly reasonable to use tools that make finding specific data records or documents more efficient. The question is really one of incrementalism.

But for most tasks, an AI system will need very significant amounts of training data to achieve an acceptable level of accuracy.

The tradeoff many leaders will face is AI accuracy. How accurate does the AI system truly need to be?

Fortunately, statistical analysis gives us well established mathematical methods for measuring and answering these types of learning questions. It is possible to train machine learning and NLP systems to reach acceptable levels of accuracy when classifying and analysing an organisation’s data trove. This is already a reality.

Sourcing data suitable for training AI systems is a non-trivial issue. Each machine learning “black box” may potentially need a lot of carefully curated training data, and significant ongoing executive governance to achieve and maintain tolerable error rates.

Bias can easily be introduced with a poorly designed training data set. Budget is likely to dictate training data set size, and hence the accuracy of AI decisions based off that training data.

For example, if you need to be able to train a system to categorise something based on particular attributes, your training data may need hundreds of training examples for each possible value of those attributes. As the number of attributes that need to be considered increases, the volume of training data rises dramatically. Smaller training sets mean higher error rates.

In the real world, developing high-quality training data sets that accurately represent a demographically accurate sample can easily be extremely costly, even for organisations with huge treasure troves of client data.

Categorisation can be imperfect and subjective. When a third-party creates a training data set, you obviously need to be confident in the consistency and accuracy of the information.

Some systems, such as machine vision and NLP language processing, are available as tools based on “pre-canned” third party training, which provides the heavy lifting for the bulk of the training task. But there are significant governance risks.

Many commercially developed AI systems rely at least in part on large scale training data sets provided by third parties. Researchers developing machine vision systems have developed large open source image classification libraries such as ImageNet, which provides a baseline set of 14 million photos organised into 20,000 categories.

Unfortunately, industry reliance on ImageNet has raised some serious issues.

ImageNet, for example, was found in 2019 to contain a large quantity of images with bizarre and patently inaccurate categorisation labels. This inaccurate information was shown by some researchers to skew the outputs of AI systems trained using ImageNet data.

Apart from data quality, the attributes that you might want to be part of the machine learning decision may be also difficult to isolate and label in the training data.

For example, determining a person’s emotional state or gender from a photo or voice recording. Being able to access context may be important to making a decision, and if it is, that is likely to dramatically influence the structure and quantity of training data required.

Within the Australian public sector, there are already some small-scale pilot projects that have been attempted using AI automation. One particular project attempted to implement an AI powered system to analyse customer service style phone conversations. The system ran into two fundamental problems – a small volume of transactions to be analysed, and poor data quality.

In short, while the system was intended to replace multiple people doing a manual process, it was implemented in a situation where the volume of transactions was too low to allow for adequate AI training. Inadequate training data means higher error rates.

So how much data do you need to train an AI system? That will always be the key question.

How much data does AI really need to learn something useful?

Our recent client work

NSW Development Coordination Authority – Business Requirements Discovery for Technology

NSW Development Coordination Authority – Organisational Design for New Entity

NSW IPART – Strategic Planning and Risk Management Workshops

Royal Far West – Governance and Organisational Design Review

NSW Customer Service – GovConnect Funding Analysis & ICT Cost Management Review

NT Director of Public Prosecution – Strategic Planning

NSW Office of Local Government – Companion Animal Laws

NSW State Emergency Service – Policy and Procedure Management Review

NSW Education – Shared Services Strategic Plan and Implementation Roadmap

NSW Crown Lands – Review of Waterfront Occupancy Agreements

Information and Privacy Commission – Development of Strategic Plan and Workforce Culture Review

Office of National Rail Safety Regulation – Development of Strategic Plan

NSW Parliamentary Services – Independent Budgetary Review and Planning

NSW Office of Transport Safety Investigations – Update Enterprise Risk Framework and Business Continuity Planning

NSW Office of the Director of Public Prosecutions – Enterprise Risk Framework Review and Risk Maturity Assessment

NSW Climate Change, Environment, Energy & Water – Functional Review of Corporate Services

UTS Institute of Public Policy & Governance – Organisational Review and Design of Target Operating Model

NSW Rural Fire Service- Personal Protective Clothing Program Review and Business Case Preparation

NSW Planning, Housing & Infrastructure – ICT Funding Analysis and Budget Baseline Review

NSW Office of the Director of Public Prosecutions – Business Case Development

City of Ryde – Independent Organisational Review of Business & Operations Directorate

NSW Planning, Housing & Infrastructure – Public Open Space Governance Plan and Strategic Reporting Framework

NSW Education – Shared Services Division Strategic and Operational Planning

Law Society of NSW – Procurement Management and Independent Evaluation

NSW Crown Lands – Analysis of Native Title Impacts on Major Programs

IPH Limited – IT Architecture and IT Operations Review

Our latest insights

Bridging the AI reliability gap. Why you need to look beyond tech solutions

Are your teams just creating AI slop? How to actually lift workforce productivity

SpencerMaurice appointed to the Queensland Government Professional Services Preferred Supplier Panel

Delay is the enemy – how to speed up organisational redesign

Automated Decision Making – 4 Risks Business Leaders Should Consider

Building Effective Teams – Are You Trying to Herd Cats?

How much data does AI really need to learn something useful?

Share this:

Our recent client work

Our latest insights

Discover more from SpencerMaurice - Management consultancy specialising in independent reviews, business strategy, transformation, organisational redesign, governance and enterprise risk