Scope:
- Define schemas and data formats for supervised / reward / rationale datasets.
- Build ingestion and normalization scripts for provided raw data.
- Set up a lightweight labeling or enrichment interface (e.g. Label Studio, Streamlit).
- Deliver documentation and simple QA tools for deduplication, sampling, and validation.
You’ll have:
- Access to domain experts and developer support.
- Clean data extracts (no data collection required).
- Flexible, outcome-based work (remote within EU).
Ideal background:
- Proven experience with LLM dataset design (SFT, RLHF, or analytical corpora).
- Strong Python (pandas / pyarrow) and Hugging Face datasets skills.
- Familiarity with labeling tools and dataset documentation best practices.
Deliverables: Schema pack, ingestion pipeline, labeling prototype, docs, and QA toolkit.