Glossary
Anaphora
Anaphora refers to language that itself refers to another piece of language.
A canonical example of anaphora is the application of a pronoun, so for instance the “them” in “what did I spend with Davidson in April, and then what did I spend with them in June”.
Anaphora is often involved in coreference resolution. An effective conversational AI needs to be able to track the way that anaphora can be applied both within and between turns in a conversation.
ASR (Automatic Speech Recognition)
Automatic Speech Recognition (ASR) is the processing and understanding of speech. It involves functions such as producing a transcribed interpretation of what someone is saying, as well as telling when they’ve started or stopped speaking. It can also identify when and why someone wishes to interrupt a conversation, their mood and the identity of a speaker based on their voice.
Chatbot
Chatbots are platforms that allow customers to interact with a business on a website, app or over the phone. Chatbots often use combinations of click commands and keywords (such as asking a customer to choose a topic e.g. Money Transfer, Check Balance) and machine learning to help resolve problems or direct customers to a live agent for further troubleshooting and resolution.
Classifier
Classifiers are machine learning models that are taught to categorise information. A computer vision model that has been trained to look at different pictures of fruit and label the type of fruit in a picture is a fruit image classifier.
In the domain of natural language processing, classifiers map linguistic input to concrete interpretations of the language to identify what the user is trying to accomplish.
Semantic classifiers map from linguistic input to a range of classes that reflect interpretations of what’s being said. This can happen on the level of both intents and entities and helps understand the overall objective of the speaker even if they use ambiguous words or indirect phrases.
Colloquialism
Colloquialisms are words and phrases that are used in conversational communication. They can include features such as slang and idiomatic expressions.
The way that humans chat to one another in day to day conversation is often somewhat different than the type of language used in more formal situations, such as giving a presentation.
Effective conversational AI must be able to respond successfully to both formal and colloquial language.
Competence/Performance
There is a dichotomy between the way language should be used (think grammar and punctuation) and how we actually use it.
Competence is our understanding of how words and phrases should be used to create sentences and performance is how we actually use those words and phrases when we communicate.
Everyday communication, whether speaking or writing, accommodates a range of different levels of performance.
Computational Linguistics
Computational linguistics applies information technology to the analysis and support of language. It includes spoken and written language as well as the perception, understanding and production of language.
Computational Linguistics is used synonymously with natural language processing.
Conversational AI
Conversational AI refers to the technology that enables humans and information processing machines to interact using text or voice. Key to this interaction is that the speaker and the machine can understand each other and hold a conversation on topics the AI has learned about.
The key to successful conversational AI is that the technology understands and responds to the way people talk when they are talking naturally and conversationally. In order for this to happen, the AI must understand complex language such as ambiguity, implicature, and nuance, as well as features of spoken language such as accent and other variations in speaking styles.
Conversational Interface (CI)
A conversational interface is a platform that enables a person to speak or write normally whilst interacting with AI. Conversational Interfaces can also be referred to as virtual assistants (such as Alexa).
Coreference
Coreference is where different words and phrases refer to the same thing in a conversation.
For example, if someone says, “tell me John’s expenses in May, then give me those for Mary, too”, the “those” in the second part of the sentence refers back to “expenses in May”.
Coreference often spans multiple turns in a conversation. If the example above received a response and then was followed up with a statement like, “OK, and how about for Dave?”, we need to understand how to add “Dave” into the context of the conversation.
Cross Evaluation
Cross evaluation is a technique for assessing the performance of data driven models. It involves apportioning data into carefully measured training and testing sets to understand what’s working with a particular model and how the model can be improved through adjustments to itself and to the training data. Cross evaluation provides the basis for qualitative and quantitative evaluation.
Data Driven Modelling
Data driven processes and models depend on data to achieve their objectives. The power of data driven models is that they can inductively generalise about situations that they haven’t seen.
For example, data driven models can accommodate common features of human language, such as people’s tendency to not adhere absolutely to the rules or grammar or to leave something implied, rather than said.
Disfluency
Disfluency describes when speech is interrupted, for instance with a filler like “uh” or “er”, by repeating a word or part of a word, or with a noticeable pause. Disfluency can be associated with neurologically rooted conditions such as stuttering, but it is typical for any human to occasionally be disfluent in the course of conversational communication.
Because disfluency is a normal part of speech, an effective conversational AI needs to be able to hear and understand disfluent input, just as a human would.
Dialogue Management
During a conversation, we tend to speak and then respond, taking turns to listen to one another as well as maintaining the coherence of a conversation across turns. Dialogue management is the process of handling and responding to text or speech on a turn-by-turn basis.
The dialogue management component of a conversational interface maintains the state of a conversation as it updates each turn.
End-of-Speech (EOS)
An important part of Automatic Speech Recognition is being able to tell when someone has finished speaking.
End-of-speech detection involves identifying when someone has finished speaking, taking into account the sound of their voice (see prosody) as well as the semantic content of what they’re saying.
End-of-Speech detection eliminates background noise and interruptions to ensure it knows for sure when the speaker has completely finished speaking.
Entailment
Entailment refers to a relationship between two phrases or statements where the truth of one statement necessitates the truth of the other.
For instance the statement “Felix is a cat” entails the statement “Felix is an animal”, since all cats are animals.
In the context of goal-directed conversational AI, we’re interested in the conceptual entailments that people use when they speak. For example, a form-filling agent that helps people with budgeting might need to be able to determine that “money spent on food I make for the kids” should update a previous statement about “money I spend on groceries”.
Entities
Entities are words and phrases that are assigned particular labels in the course of speech. They are necessary for successfully responding to questions, such as, “this year what had I spent with Davidson through to March 31?”
‘Davidson’, ‘this year’, and ‘through to March 31’ are the entities needed to understand and respond to the question. Davidson might be assigned a label like ‘account’, and then ‘this year’ and ‘through to March 31’ are labelled as dates.
Implicature
Implicature (implying something that has not been said) is when a speaker implies meaning by saying something else.
An instance of implicature can rely on non-literal interpretation as well as an understanding of context, for example:
- Statement: “I think I lost my bank card yesterday”
- Implicature: “My account needs to be made secure”
Induction
Induction is the process of deriving conclusions from observations. Machine learning operates through induction, by learning to generalise based on observations of training data
In the context of conversational AI, induction involves observing a small set of inputs and then using these observations to draw conclusions about interpreting a much broader range of inputs.
This allows for the interpretation of inputs that can sometimes be formed in unexpected and even ungrammatical ways, It also interprets inputs that don’t directly trigger things like keywords and instead fall back on standard features of language such as implicature.
Intents
An intent is what someone hopes to achieve by speaking e.g. you intend to inform someone about a news item, event or subject of interest.
In a conversational AI setting, intents are projected as abstract classes that are the targets of classifiers trained using data driven machine learning techniques.
Intents can interface with entities in places where a piece of information, conveyed by a specific word or phrase, is necessary for satisfying the interpretation of the intent.
For example, in the sentence “I’d like to know what Anderson paid me last month”, there is an intent involving an “incoming transaction.” In order to offer a satisfactory resolution of this intent, we need to know that the intent is specifically about the word/phrase “Anderson” and “last month”.
Knowledge Graphs
Machine Learning
Machine learning is a branch of computer science based on the idea that machines can learn to make general conclusions about a topic based on observations of data about that topic.
A prime objective of machine learning is to use data to train models that can make human-like determinations about human-level problems. This is achieved by inherently learning to recognise the complex underlying patterns that humans (often unconsciously) also use to make decisions and take actions.
Multi-Turn Dialogue
Multi-turn dialogue describes a conversation where participants take turns speaking. Essentially all day-to-day conversations are multi-turn dialogue.
People involved in two-person conversation will have a reasonable expectation that the other person will maintain a broader sense of the way the conversation is evolving. In other words, conversations can’t be successful if understanding is only happening on a turn-by-turn basis.
In a goal orientated dialogue with an agent, a customer might say “could you add £100 to the amount I gave you for groceries”, and also to use coreference across turns with statements like “could I have that same information but for 2020”.
Multi-Intent Utterances
Multi-intent utterances are statements with more than one goal. An example of a multi-intent utterance might be:
- “How much have I got in my account and what’s my overdraft limit”
Understanding multi-intent inputs can sometimes require complex coreference resolution, as well as bundling the right entities with the right intents.
For example:
- “I’d like to know what was my largest expense in January, and then also in February, and then could you please tell me what I did with Davidson and Hamilton in each of those months”
Natural Language Processing (NLP)
Natural language processing is the manipulation of natural language – the real-world language people use to communicate – for the purposes of understanding, analysis, and response.
Natural Language Processing takes in speech processing, natural language understanding and speech generation.
The term NLP can be used somewhat synonymously with the term computational linguistics
Natural Language Understanding (NLU)
Natural language understanding (NLU) is the use of computational models to output representations of what is intended by a natural language input.
Natural language understanding can also include modelling the way different expressions can refer to a single concept. For instance someone using a conversational interface to help put together a budget might talk about “cash for physio for my bad back.” An appropriately configured NLU model would infer that this is an expense relating to “Healthcare”.
Negation
Negation is the cancelling out or denying of a particular meaning. Negations in English often involve the word “not”, as in “I’d like to know what I spent in June, not July”. However there are many other ways for a speaker to indicate what they don’t mean, beginning with phrases like “instead of” and “rather than”.
Negation is often a feature of repair, and so correctly identifying negation is essential in successfully processing user inputs like, “you gave me July, but I wanted June”.
Out-of-Vocabulary Words
It is typical for a natural language processing system to have a set of words for which it has representations, whether as spoken words (so combinations of sounds) or as symbolic representations (like entries in a dictionary). Out-of-vocabulary words are terms that are not part of the system’s lexicon.
Out-of-vocabulary words are inevitably encountered in the course of conversation (by both machines and humans). Having a good strategy for dealing with both hearing and interpreting these words, including a reaction acknowledging if something hasn’t been understood, is a key part of communication.
Pragmatics
Pragmatics is the aspect of linguistics concerned with interpreting what people want to accomplish through communication. Pragmatics considers the difference between the ‘meaning’ of just what a person has said, and the thing the person is actually trying to accomplish with their utterance. As such, implicature is a primary object of the study of pragmatics.
Precision (statistic)
Precision is a statistic that calculates the rate at which the predictions that a model makes are correct.
High precision means the guess that a classifier makes about a particular class is very likely to be correct.
Prosody
Prosody refers to the patterns of rhythms and tones defining a sound that constitutes spoken language. Interpreting the way that a speaker pauses, emphasises, and intones their speech is an important part of Automatic Speech Recognition.
Qualitative Evaluation
Qualitative evaluation is a mode of evaluating a model. It involves looking critically at the output of the model and trying to understand why output for data not seen during training varies from our expectations.
This mode of evaluation involves careful manual consideration of models on an output-by-output basis and provides insight into particular aspects of model performance, for instance revealing unforeseen correlations between input and inference. It is the counterpart of quantitative evaluation.
Quantitative Evaluation
Quantitative evaluation is the evaluation of a model’s overall performance based on statistical measures encompassing a large set of testing samples. Quantitative evaluation is an important technique for measuring the way that data and models evolve to fulfil the communicative objectives for which they’re designed. It is the counterpart of qualitative evaluation.
Recall (statistic)
Recall is a statistic that measures the degree to which a model covers the targets it has been designed to cover.
High recall means that a model is very good at discovering correct interpretations. It is possible to achieve arbitrarily high recall, for instance by simply predicting that every possible interpretation of an input is correct.
There is often a tradeoff between high recall and high precision, however: a model that predicts that everything is always true will be very prone to false positives.
Repair
Repair is the act of making a correction during a conversation. It can be used in two ways:
- Inter-turn repair is where a speaker amends semantic content from a previous turn in a conversation. This can be a result of either a misunderstanding between speakers (“no, I wanted Davidson, not Hamilton”) or a speaker correcting themselves (“oh, actually, I wanted Davidson, not Hamilton”).
- Intra-turn repair is where a speaker corrects themselves in the course of a single sentence. This is a feature of spoken language in particular, for example someone might say “I’d like to know about what I’ve been spending with the Hamilton Corporation – I mean, with the Davidson Corporation this week”.
Rules-Based Systems
Rules-based systems rely on rules (such as: if this information is input, respond with this reply) to process information. An example of a rules-based natural language processing system is one that applies a combination of a grammar formalisation and lists of the semantic roles words can play in order to understand speech.
Traditionally computational linguistics has been dominated by rules-based systems, but most of the powerful contemporary models for natural language understanding apply some degree of data driven machine learning.
Semantic Ambiguity
Semantic ambiguity is the propensity for words to take on different meanings in different contexts.
For instance the word “balance” as applied in an utterance like “what was my account balance last week” has a different connotation than it would in an utterance like “what on balance was the company I spent the most with over the year”.
Flexible representations of the meaning of any particular word are required in order for a conversational AI to be able to resolve semantic ambiguity in context.
Semantics
Semantics is the branch of linguistics concerned with finding meaning in language. In natural language processing, semantic classification involves taking speech as input and mapping it to a range of classes representing interpretations of meaning.
Speech Synthesis
Speech synthesis is the process of generating speech via a computer that outputs natural sounding language in real time. It is the final component of a spoken-language conversation interface, and so is the part that a system user directly experiences.
A speech synthesiser needs to be able to dynamically react to what people say and the way they say it.
Speech to Text
Speech to text is an aspect of automatic speech recognition that turns the spoken word into the written word. In essence it is an automatic transcription.
Syntax
Syntax is the aspect of linguistics concerned with the way that words are put together to create meaningful speech. Syntax is generally concerned with the rules we know of as grammar and focuses on how those rules can be combined to create speech with meaning.
Text to Speech
This element of speech synthesis involves taking the written word and converting it into speech, spoken by the AI.
Utterance
An utterance is a single, complete act of linguistic expression. Any utterance is composed of at least one (and typically more than one) word. A conversation is composed of a series of utterances made by different speakers.
There is no strict rule about what an utterance comprises. It can be a sentence, but it does not need to be a complete sentence. It can also consist of multiple sentences. In practice people use utterances of various shapes and sizes during a conversation.
Vernacular
A vernacular language is one that is spoken by a specific, often geographically defined but also possibly socially specified group of people.
The term is often used to refer to a dialect of a particular language that has emerged in informal communication between a certain group of people. As such, a vernacular is often characterised by colloquialism.
Word Error Rate (WER)
Word Error Rate is a statistic used for quantitative evaluation of automatic speech recognition, and in particular of a text to speech model.
It measures the degree to which words in a predicted text have to be changed in order to make the text that was predicted look exactly the same as the text that should have been predicted.