Skip to content

TPC-AI/warc_processing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

warc_processng

pipeline for pre-processing warc files from CommonCrawl

initial inspiration came from https://arxiv.org/pdf/2306.01116.pdf (The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only)

Presentation1

About

pipeline for pre-processing warc files from CommonCrawl

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published