🤖 Vocabulary

Data Science & AI Vocabulary

Machine learning, statistics, and data analysis terminology

15 terms total·4 beginner · 8 intermediate · 3 advanced

beginner (4 terms)

Algorithm

/ˈælɡərɪðəm/

A finite set of well-defined instructions for solving a problem or performing a computation.

Model

/ˈmɒd(ə)l/

A mathematical or computational representation of a system, learned from data to make predictions or decisions.

Feature

/ˈfiːtʃə/

An individual measurable property or characteristic used as input to a machine learning model.

Pipeline

/ˈpaɪplaɪn/

A sequence of data processing steps chained together so the output of each step feeds directly into the next.

intermediate (8 terms)

Overfitting

/ˌəʊvəˈfɪtɪŋ/

When a model learns the training data too precisely — including noise — and generalises poorly to new data.

Cross-Validation

/krɒs ˌvælɪˈdeɪʃ(ə)n/

A resampling technique that evaluates model performance by repeatedly partitioning data into training and validation folds.

Neural Network

/ˈnjʊərəl ˈnetwɜːk/

A computational system loosely modelled on the brain, composed of layers of interconnected nodes that learn representations from data.

Gradient Descent

/ˈɡreɪdiənt dɪˈsent/

An optimisation algorithm that iteratively adjusts model parameters in the direction that minimises the loss function.

Precision

/prɪˈsɪʒ(ə)n/

The proportion of predicted positive cases that are truly positive: TP / (TP + FP).

Recall

/rɪˈkɔːl/

The proportion of actual positive cases that are correctly identified: TP / (TP + FN).

Embedding

/ɪmˈbedɪŋ/

A dense, low-dimensional numerical representation of high-dimensional data (e.g., words, entities) learned from data.

Inference

/ˈɪnf(ə)r(ə)ns/

The process of using a trained model to make predictions on new, unseen data.

advanced (3 terms)

Transformer

/trænsˈfɔːmə/

A deep learning architecture based on self-attention mechanisms, widely used for NLP and increasingly in other modalities.

Bias–Variance Trade-off

/ˈbaɪəs ˈveəriəns/

The tension between a model's error due to simplifying assumptions (bias) and its sensitivity to fluctuations in training data (variance).

Data Leakage

/ˈdeɪtə ˈliːkɪdʒ/

When information from outside the training set is used to build the model, causing overly optimistic performance estimates.