AI / ML Training Data

Subsurface Training Data for Machine Learning

Cleaned, labelled, depth-aligned wireline data from 12,000+ Australasian wells. The 57-year QC workflow that produces our archive is the refinery for AI-ready data: missing-curve imputation, lithology classification, and basin analogue search start with input data that doesn't need rebuilding.

Discuss training-data licence Inspect a sample well

Why this archive is different for ML

Most public well-log corpora are raw scans or unverified LAS dumps. The labels are noisy, the depths drift, and the curves overlap inconsistently. Our archive has been processed against the same QC workflow since 1969.

Depth-aligned

Composite logs spliced from multiple runs with explicit depth shifts. No silent gaps masquerading as zero values.

Labelled

Headers validated against state and territory registers. Formation tops, lithology calls, and zonations from the original well completion reports.

Documented

Issues flagged, not silently dropped. Full provenance per well so model training can filter or weight by data confidence.

What's in the dataset

Bulk extracts in the format your pipeline expects (Parquet, HDF5, JSON Lines, or LAS) covering everything below.

7.2B+

Log data points

Gamma ray, resistivity, sonic, density, neutron, caliper, SP — depth-indexed, units-normalised.

12,000+

Wells

77 sedimentary basins across Australia, NZ, Timor-Leste, PNG. Single-stratigraphy and multi-basin training corpora available.

4,000+

Well data summaries

Geoscientist-compiled structured documents: formation tops, core analysis, geochemistry, biostratigraphy. High-quality labels for supervised tasks.

57

Years of curation

Data spanning the modern history of Australasian petroleum exploration. Long temporal coverage for time-series and drift modelling.

The 57-year QC refinery

Every well in the archive has been through the same pipeline. The output is what most ML teams spend the first three months of a project building.

1

Header validation

Cross-check well name, location, datum, depth range against state and territory government registers. Flag mismatches.

2

Curve QC

Each curve checked for spikes, gaps, null values (no -999.25 leaks), and unit consistency. Issues documented, not silently dropped.

3

Composite splicing

Multiple logging runs aligned to a single composite per well. Depth shifts applied where original runs disagree. Provenance retained.

4

Format normalisation

LAS 2.0 conformant output, standardised mnemonics, consistent unit conventions. Training pipelines don't need a per-well preamble.

What teams use this for

Well-log imputation

Predicting missing curves from co-recorded ones. Train against composites where every curve is available, then deploy on partial logs.

Typical inputs: GR + resistivity + sonic. Typical target: density or neutron.

Lithology classification

Supervised lithology calls per depth from log signatures. WDS formation tops give labelled intervals; gamma + resistivity + density give features.

Multi-basin generalisation possible because labels and units are normalised.

Basin analogue search

Embedding-based well similarity for analogue exploration. Find wells with similar log profiles to a target across basins.

Works well over the Australasian dataset because of consistent QC across decades.

Petrophysical foundation models

Pre-train on this corpus, fine-tune on a target asset. The 57-year time span gives long-tail coverage that production data alone can't.

Custom licence terms accommodate model weights distribution.

Licensing for training data

Bulk extraction with explicit training-data licence terms. Per-curve-metre pricing for individual wells; custom licence for full-corpus training. Tell us what you're training and we'll prepare a quote and licence terms tailored to the project.

Discuss licence terms See sample WDS