Structured textbook-derived data

Recovered Knowledge Corpus

The Recovered Knowledge Corpus (RKC) turns textbook-derived sources into clean, structure-rich, machine-readable pages—preserving layout, equations, tables, figures, and metadata for AI training and evaluation across mathematics, STEM, languages, and the humanities.

Request Access Download Abstract See Examples

Built on the MonkAI restoration pipeline and a long-running commercial catalog of restored titles, designed for long-context model training and evaluation.

Catalog

3.1M Textbooks

1.3M digitized · 1.8M queued

Pages

~900M pages

Long-form books, not snippets

Languages

6 languages

EN, DE, FR, ES, PT, IT

Rights

Rights-cleared

Single-licensor · audit trails

What the RKC is

RKC is a large-scale corpus produced by the MonkAI restoration pipeline. Textbook-derived sources are converted into structured digital pages with reading order, equations, tables, figures, captions, and footnotes retained — ready for pretraining, long-context reasoning, layout-aware tasks, and document AI evaluation. The catalog spans core subjects such as mathematics and physics, engineering and computer science, medicine and life sciences, and the humanities and social sciences.

Pricing & Scale

Long-form quality, without the licensing friction

RKC delivers book-length, high-fidelity training data with structured exports designed for modern training and evaluation workflows.

Request Pricing / Access

RKC process examples

These examples show a pre-digitized or baseline-OCR page compared with the RKC reconstruction and structured regions that keep figures and captions together, resulting in visually faithful, selectable text and figures and structured regions ready for subject-specific training and evaluation.

Mathematics

Mathematics: symbol-level accuracy where OCR fails

The first panel is a page processed with a baseline OCR engine; the red marks indicate symbol, index, and operator errors that change the meaning. The second panel is the RKC reconstruction: selectable text that matches the original layout closely enough to use as ground truth.

First panel · baseline OCR output

Second panel · RKC reconstruction

Complex figures & captions

Figures: images and captions kept together

Complicated page layouts often mix multiple figures, labels, and captions in a single block of print. The first panel shows the pre-digitized page; the second panel shows RKC structured regions, grouping each figure with its attached caption so they stay linked for training, retrieval, and evaluation.

First panel · pre-digitized

Second panel · RKC structured regions

Dataset coverage

RKC datasets are delivered as page-faithful textbook-derived data with consistent metadata and optional multimodal structure (text + image + layout) for training, retrieval, and evaluation.

Mathematics

17,917 textbooks

As of: Dec 2025

Arithmetic through graduate mathematics
Formulas, symbols, diagrams, tables, equation plates
Step-by-step reasoning traces (problem → method → solution)

Medicine

66,185 textbooks

As of: Dec 2025

General medicine, anatomy, physiology, pathology, pharmacology
Plates, charts, tables, diagnostic layouts
Clinical reasoning traces (symptom → diagnosis → treatment)

Science, Technology & Engineering

127,655 textbooks

As of: Dec 2025

Physics, chemistry, astronomy, engineering, applied sciences
Equations, schematics, graphs, diagrams, tables
Worked derivations and procedures (hypothesis → method → result)

Ancient Greek & Latin

30,000+ textbooks

As of: Dec 2025

Polytonic diacritics, scholia, apparatus criticus
Layout-preserved maps, plates, diagrams and tables
Citation-grounded retrieval (canonical references)

French

177,709 textbooks

As of: Dec 2025

STEM, law, medicine, philosophy, literature, arts
Layout-preserved tables, diagrams, illustrations
Normalized orthography for contemporary processing

German

183,412 textbooks

As of: Dec 2025

STEM, law, history, theology, literature, philosophy
Fraktur/Gothic and early orthography normalized
Layout-preserved diagrams, tables, illustrations

Italian

42,565 textbooks

As of: Dec 2025

Science, engineering, medicine, law, literature
Layout-preserved tables, diagrams, engravings
Variant mapping data for orthography and typography

Spanish

40,535 textbooks

As of: Dec 2025

Iberian and Latin American coverage
Normalized historic spellings; preserved layout and illustrations
Text-only or multimodal pipelines (text + image + layout)

Discovery surfaces

Hugging Face

Coming soon

GitHub

Coming soon

Who the RKC is for

RKC is designed for teams who need long-form, layout-aware, provenance-traceable textbook-derived data at scale across core subjects like math, science, medicine, and the humanities.

AI labs & model builders

Pretraining and continued pretraining on long-context, richly structured textbooks, including math-heavy STEM titles, medical reference works, and multilingual corpora.

Research groups

Long-context benchmarks, math and table reasoning, document understanding, and error analysis across mathematics, STEM, languages, and more.

RAG & EdTech teams

Structured pedagogy, clean metadata, and reliable retrieval across core subjects such as math, science, and language learning.

Single-licensor, rights-cleared corpus

A single, straightforward license covers the entire corpus. All underlying rights are owned or controlled by us, so you don’t need separate permissions for individual works.

Definition: “Single-licensor, rights-cleared corpus” means the collection of works for which our organization owns or exclusively controls all rights necessary to grant the license described in the Agreement.

Pricing & Access

Request access or a sample pack

Tell us a little about your use case and we’ll follow up with access details, a sample bundle, or a short call.

Prefer email? Reach us at contact@recoveredcorpus.ai.