Subsurface Training Data for Machine Learning
Cleaned, labelled, depth-aligned wireline data from 12,000+ Australasian wells. The 57-year QC workflow that produces our archive is the refinery for AI-ready data: missing-curve imputation, lithology classification, and basin analogue search start with input data that doesn't need rebuilding.
Why this archive is different for ML
Most public well-log corpora are raw scans or unverified LAS dumps. The labels are noisy, the depths drift, and the curves overlap inconsistently. Our archive has been processed against the same QC workflow since 1969.
Depth-aligned
Composite logs spliced from multiple runs with explicit depth shifts. No silent gaps masquerading as zero values.
Labelled
Headers validated against state and territory registers. Formation tops, lithology calls, and zonations from the original well completion reports.
Documented
Issues flagged, not silently dropped. Full provenance per well so model training can filter or weight by data confidence.
What's in the dataset
Bulk extracts in the format your pipeline expects (Parquet, HDF5, JSON Lines, or LAS) covering everything below.
Gamma ray, resistivity, sonic, density, neutron, caliper, SP — depth-indexed, units-normalised.
77 sedimentary basins across Australia, NZ, Timor-Leste, PNG. Single-stratigraphy and multi-basin training corpora available.
Geoscientist-compiled structured documents: formation tops, core analysis, geochemistry, biostratigraphy. High-quality labels for supervised tasks.
Data spanning the modern history of Australasian petroleum exploration. Long temporal coverage for time-series and drift modelling.
The 57-year QC refinery
Every well in the archive has been through the same pipeline. The output is what most ML teams spend the first three months of a project building.
Header validation
Cross-check well name, location, datum, depth range against state and territory government registers. Flag mismatches.
Curve QC
Each curve checked for spikes, gaps, null values (no -999.25 leaks), and unit consistency. Issues documented, not silently dropped.
Composite splicing
Multiple logging runs aligned to a single composite per well. Depth shifts applied where original runs disagree. Provenance retained.
Format normalisation
LAS 2.0 conformant output, standardised mnemonics, consistent unit conventions. Training pipelines don't need a per-well preamble.
What teams use this for
Well-log imputation
Predicting missing curves from co-recorded ones. Train against composites where every curve is available, then deploy on partial logs.
Typical inputs: GR + resistivity + sonic. Typical target: density or neutron.
Lithology classification
Supervised lithology calls per depth from log signatures. WDS formation tops give labelled intervals; gamma + resistivity + density give features.
Multi-basin generalisation possible because labels and units are normalised.
Basin analogue search
Embedding-based well similarity for analogue exploration. Find wells with similar log profiles to a target across basins.
Works well over the Australasian dataset because of consistent QC across decades.
Petrophysical foundation models
Pre-train on this corpus, fine-tune on a target asset. The 57-year time span gives long-tail coverage that production data alone can't.
Custom licence terms accommodate model weights distribution.
Licensing for training data
Bulk extraction with explicit training-data licence terms. Per-curve-metre pricing for individual wells; custom licence for full-corpus training. Tell us what you're training and we'll prepare a quote and licence terms tailored to the project.