Structured textbook-derived data

Recovered Knowledge Corpus

The Recovered Knowledge Corpus (RKC) turns textbook-derived sources into clean, structure-rich, machine-readable pages—preserving layout, equations, tables, figures, and metadata for AI training and evaluation across mathematics, STEM, languages, and the humanities.

Built on the MonkAI restoration pipeline and a long-running commercial catalog of restored titles, designed for long-context model training and evaluation.

Catalog
3.1M Textbooks
1.3M digitized · 1.8M queued
Pages
~900M pages
Long-form books, not snippets
Languages
6 languages
EN, DE, FR, ES, PT, IT
Rights
Rights-cleared
Single-licensor · audit trails

What the RKC is

RKC is a large-scale corpus produced by the MonkAI restoration pipeline. Textbook-derived sources are converted into structured digital pages with reading order, equations, tables, figures, captions, and footnotes retained — ready for pretraining, long-context reasoning, layout-aware tasks, and document AI evaluation. The catalog spans core subjects such as mathematics and physics, engineering and computer science, medicine and life sciences, and the humanities and social sciences.

Pricing & Scale

Long-form quality, without the licensing friction

RKC delivers book-length, high-fidelity training data with structured exports designed for modern training and evaluation workflows.

Request Pricing / Access

RKC process examples

These examples show a pre-digitized or baseline-OCR page compared with the RKC reconstruction and structured regions that keep figures and captions together, resulting in visually faithful, selectable text and figures and structured regions ready for subject-specific training and evaluation.

Mathematics
Mathematics: symbol-level accuracy where OCR fails

The first panel is a page processed with a baseline OCR engine; the red marks indicate symbol, index, and operator errors that change the meaning. The second panel is the RKC reconstruction: selectable text that matches the original layout closely enough to use as ground truth.

First panel · baseline OCR output
Second panel · RKC reconstruction
Complex figures & captions
Figures: images and captions kept together

Complicated page layouts often mix multiple figures, labels, and captions in a single block of print. The first panel shows the pre-digitized page; the second panel shows RKC structured regions, grouping each figure with its attached caption so they stay linked for training, retrieval, and evaluation.

First panel · pre-digitized
Second panel · RKC structured regions

Dataset coverage

RKC datasets are delivered as page-faithful textbook-derived data with consistent metadata and optional multimodal structure (text + image + layout) for training, retrieval, and evaluation.

Mathematics
17,917 textbooks
As of: Dec 2025
  • Arithmetic through graduate mathematics
  • Formulas, symbols, diagrams, tables, equation plates
  • Step-by-step reasoning traces (problem → method → solution)
Medicine
66,185 textbooks
As of: Dec 2025
  • General medicine, anatomy, physiology, pathology, pharmacology
  • Plates, charts, tables, diagnostic layouts
  • Clinical reasoning traces (symptom → diagnosis → treatment)
Science, Technology & Engineering
127,655 textbooks
As of: Dec 2025
  • Physics, chemistry, astronomy, engineering, applied sciences
  • Equations, schematics, graphs, diagrams, tables
  • Worked derivations and procedures (hypothesis → method → result)
Ancient Greek & Latin
30,000+ textbooks
As of: Dec 2025
  • Polytonic diacritics, scholia, apparatus criticus
  • Layout-preserved maps, plates, diagrams and tables
  • Citation-grounded retrieval (canonical references)
French
177,709 textbooks
As of: Dec 2025
  • STEM, law, medicine, philosophy, literature, arts
  • Layout-preserved tables, diagrams, illustrations
  • Normalized orthography for contemporary processing
German
183,412 textbooks
As of: Dec 2025
  • STEM, law, history, theology, literature, philosophy
  • Fraktur/Gothic and early orthography normalized
  • Layout-preserved diagrams, tables, illustrations
Italian
42,565 textbooks
As of: Dec 2025
  • Science, engineering, medicine, law, literature
  • Layout-preserved tables, diagrams, engravings
  • Variant mapping data for orthography and typography
Spanish
40,535 textbooks
As of: Dec 2025
  • Iberian and Latin American coverage
  • Normalized historic spellings; preserved layout and illustrations
  • Text-only or multimodal pipelines (text + image + layout)
Discovery surfaces
Hugging Face
Coming soon
GitHub
Coming soon

Who the RKC is for

RKC is designed for teams who need long-form, layout-aware, provenance-traceable textbook-derived data at scale across core subjects like math, science, medicine, and the humanities.

AI labs & model builders

Pretraining and continued pretraining on long-context, richly structured textbooks, including math-heavy STEM titles, medical reference works, and multilingual corpora.

Research groups

Long-context benchmarks, math and table reasoning, document understanding, and error analysis across mathematics, STEM, languages, and more.

RAG & EdTech teams

Structured pedagogy, clean metadata, and reliable retrieval across core subjects such as math, science, and language learning.

Single-licensor, rights-cleared corpus

A single, straightforward license covers the entire corpus. All underlying rights are owned or controlled by us, so you don’t need separate permissions for individual works.

Definition: “Single-licensor, rights-cleared corpus” means the collection of works for which our organization owns or exclusively controls all rights necessary to grant the license described in the Agreement.

Pricing & Access

Request access or a sample pack

Tell us a little about your use case and we’ll follow up with access details, a sample bundle, or a short call.

Prefer email? Reach us at contact@recoveredcorpus.ai.