The first panel is a page processed with a baseline OCR engine; the red marks indicate symbol, index, and operator errors that change the meaning. The second panel is the RKC reconstruction: selectable text that matches the original layout closely enough to use as ground truth.
Recovered Knowledge Corpus
The Recovered Knowledge Corpus (RKC) turns textbook-derived sources into clean, structure-rich, machine-readable pages—preserving layout, equations, tables, figures, and metadata for AI training and evaluation across mathematics, STEM, languages, and the humanities.
Built on the MonkAI restoration pipeline and a long-running commercial catalog of restored titles, designed for long-context model training and evaluation.
What the RKC is
RKC is a large-scale corpus produced by the MonkAI restoration pipeline. Textbook-derived sources are converted into structured digital pages with reading order, equations, tables, figures, captions, and footnotes retained — ready for pretraining, long-context reasoning, layout-aware tasks, and document AI evaluation. The catalog spans core subjects such as mathematics and physics, engineering and computer science, medicine and life sciences, and the humanities and social sciences.
Long-form quality, without the licensing friction
RKC delivers book-length, high-fidelity training data with structured exports designed for modern training and evaluation workflows.
RKC process examples
These examples show a pre-digitized or baseline-OCR page compared with the RKC reconstruction and structured regions that keep figures and captions together, resulting in visually faithful, selectable text and figures and structured regions ready for subject-specific training and evaluation.
Complicated page layouts often mix multiple figures, labels, and captions in a single block of print. The first panel shows the pre-digitized page; the second panel shows RKC structured regions, grouping each figure with its attached caption so they stay linked for training, retrieval, and evaluation.
Dataset coverage
RKC datasets are delivered as page-faithful textbook-derived data with consistent metadata and optional multimodal structure (text + image + layout) for training, retrieval, and evaluation.
- Arithmetic through graduate mathematics
- Formulas, symbols, diagrams, tables, equation plates
- Step-by-step reasoning traces (problem → method → solution)
- General medicine, anatomy, physiology, pathology, pharmacology
- Plates, charts, tables, diagnostic layouts
- Clinical reasoning traces (symptom → diagnosis → treatment)
- Physics, chemistry, astronomy, engineering, applied sciences
- Equations, schematics, graphs, diagrams, tables
- Worked derivations and procedures (hypothesis → method → result)
- Polytonic diacritics, scholia, apparatus criticus
- Layout-preserved maps, plates, diagrams and tables
- Citation-grounded retrieval (canonical references)
- STEM, law, medicine, philosophy, literature, arts
- Layout-preserved tables, diagrams, illustrations
- Normalized orthography for contemporary processing
- STEM, law, history, theology, literature, philosophy
- Fraktur/Gothic and early orthography normalized
- Layout-preserved diagrams, tables, illustrations
- Science, engineering, medicine, law, literature
- Layout-preserved tables, diagrams, engravings
- Variant mapping data for orthography and typography
- Iberian and Latin American coverage
- Normalized historic spellings; preserved layout and illustrations
- Text-only or multimodal pipelines (text + image + layout)
Who the RKC is for
RKC is designed for teams who need long-form, layout-aware, provenance-traceable textbook-derived data at scale across core subjects like math, science, medicine, and the humanities.
AI labs & model builders
Pretraining and continued pretraining on long-context, richly structured textbooks, including math-heavy STEM titles, medical reference works, and multilingual corpora.
Research groups
Long-context benchmarks, math and table reasoning, document understanding, and error analysis across mathematics, STEM, languages, and more.
RAG & EdTech teams
Structured pedagogy, clean metadata, and reliable retrieval across core subjects such as math, science, and language learning.
Single-licensor, rights-cleared corpus
A single, straightforward license covers the entire corpus. All underlying rights are owned or controlled by us, so you don’t need separate permissions for individual works.
Definition: “Single-licensor, rights-cleared corpus” means the collection of works for which our organization owns or exclusively controls all rights necessary to grant the license described in the Agreement.
Pricing & AccessRequest access or a sample pack
Tell us a little about your use case and we’ll follow up with access details, a sample bundle, or a short call.
Prefer email? Reach us at contact@recoveredcorpus.ai.