Dataset Method

Dataset Method

RMR vs TQT

RMR vs TQT

RMR and TQT stay separate because they serve different jobs: human-readable review on one side, metric-clean text on the other.

RMR and TQT stay separate because they serve different jobs: human-readable review on one side, metric-clean text on the other.

/prm/dataset-structure/rmr-vs-tqt

/prm/dataset-structure/rmr-vs-tqt

RMR

Review-friendly

Preserves structure for human review, provenance checks, and source-facing QA.

TQT

Measurement-clean

Normalizes text for statistical work so metric behavior is not driven by formatting artifacts.

Separation

Review vs stats

Keeps public claims tied to the right layer of the pipeline.

Evidence Frame

RMR

RMR is the review-friendly transform.

It preserves more readable structure so a reviewer can inspect flow, segmentation, and preparation choices inside a controlled environment.

RMR is not a public source publication. It is the layer that helps explain how material moved through preparation without turning the site into a corpus dump.

TQT

TQT is the measurement-friendly transform.

It is used for metric consistency, reduced punctuation interference, tokenizer analysis, and statistical comparison.

TQT makes the corpus easier to compare across tokenizers, slice sizes, and metric families by reducing formatting noise before quantitative analysis.

Why both exist

RMR and TQT answer different questions.

RMR helps preserve review readability.

TQT helps stabilize quantitative analysis.

Together they let PRM be reviewed as both writing and measurable language data without merging those jobs into one unstable layer.

How they fit the pipeline

The pipeline uses transforms to keep review and measurement separate. RMR supports controlled human review. TQT supports repeatable metric work.

That separation matters because a public-safe website cannot expose protected source text, private manifests, or source-level mappings. The transform layer lets the project describe the process while raw material remains inside the controlled review boundary.

What the comparison does not mean

RMR and TQT are not competing claims about which version is “real.” They are different working views of the same underlying material.

The public page does not ask readers to reconstruct the corpus from either transform. It explains why two transform layers exist and how they support reviewability, measurement consistency, and public-safe reporting.

How to read the charts

Start with deep metric coverage to see why transform discipline matters. The chart shows that RMR and TQT are not cosmetic versions of the text; they sit upstream of a large measurement surface involving workbooks, metric families, tokenizers, segments, and slice sizes.

Tokenizer robustness is the clearest TQT-facing check: it asks whether normalized measurement text still produces stable results across tokenizer systems. Raw trends, the metric-slice heatmap, and the 65536 macro lens then show whether the transform choices support large-scale aggregate behavior instead of only small-excerpt effects.

Public-safe limits

This page describes transform roles and aggregate outputs. It does not publish raw writing, protected excerpts, private transform files, source-level mappings, third-party source labels, artist names, album titles, or song titles.

Public-safe boundary

Public pages show aggregate evidence, metric behavior, method provenance, and corpus structure. Protected text, identities, source titles, and reconstructable mappings stay private.