Skip to content

Optimize dataset preprocessing before model traning/evaluation

Alexander Chueshev requested to merge optimize-dataset-preprocessing into main

This MR optimizes the dataset preprocessing before model training/evaluation by applying the following changes:

  • use HF map operations to preprocess the dataset on rank 0 only and cache it. Other ranks load the dataset from the cache.
  • support both full preprocessing and streaming. By default, we set streaming to True since our full dataset requires at least 4TB to store cache

Ref: ai-assist#22 (closed)

Edited by Alexander Chueshev

Merge request reports