Data Science & AI Vocabulary
Machine learning, statistics, and data analysis terminology
beginner (4 terms)
Algorithm
/ˈælɡərɪðəm/
A finite set of well-defined instructions for solving a problem or performing a computation.
Model
/ˈmɒd(ə)l/
A mathematical or computational representation of a system, learned from data to make predictions or decisions.
Feature
/ˈfiːtʃə/
An individual measurable property or characteristic used as input to a machine learning model.
Pipeline
/ˈpaɪplaɪn/
A sequence of data processing steps chained together so the output of each step feeds directly into the next.
intermediate (8 terms)
Overfitting
/ˌəʊvəˈfɪtɪŋ/
When a model learns the training data too precisely — including noise — and generalises poorly to new data.
Cross-Validation
/krɒs ˌvælɪˈdeɪʃ(ə)n/
A resampling technique that evaluates model performance by repeatedly partitioning data into training and validation folds.
Neural Network
/ˈnjʊərəl ˈnetwɜːk/
A computational system loosely modelled on the brain, composed of layers of interconnected nodes that learn representations from data.
Gradient Descent
/ˈɡreɪdiənt dɪˈsent/
An optimisation algorithm that iteratively adjusts model parameters in the direction that minimises the loss function.
Precision
/prɪˈsɪʒ(ə)n/
The proportion of predicted positive cases that are truly positive: TP / (TP + FP).
Recall
/rɪˈkɔːl/
The proportion of actual positive cases that are correctly identified: TP / (TP + FN).
Embedding
/ɪmˈbedɪŋ/
A dense, low-dimensional numerical representation of high-dimensional data (e.g., words, entities) learned from data.
Inference
/ˈɪnf(ə)r(ə)ns/
The process of using a trained model to make predictions on new, unseen data.
advanced (3 terms)
Transformer
/trænsˈfɔːmə/
A deep learning architecture based on self-attention mechanisms, widely used for NLP and increasingly in other modalities.
Bias–Variance Trade-off
/ˈbaɪəs ˈveəriəns/
The tension between a model's error due to simplifying assumptions (bias) and its sensitivity to fluctuations in training data (variance).
Data Leakage
/ˈdeɪtə ˈliːkɪdʒ/
When information from outside the training set is used to build the model, causing overly optimistic performance estimates.