It's been a few years, but I did a project over about 6 months .. I think around ...

It's been a few years, but I did a project over about 6 months .. I think around 2022/23 ... where I downloaded many GB of data from here ..https://commoncrawl.org/latest-crawl .. and did some machine learning analysis ... was at first going to try to do it on the **content** of the html files, until I realized that would require serious big-data infrastructure, so i ended up just doing it on the **urls** and not even getting into the HTML content. Was pretty interesting. I was able to train a (binary) text classification model, using a model like one of these -- https://huggingface.co/models?pipeline_tag=text-classification&sort=trending -- it really worked very well for my purposes. As I remember I looked at some various options and ended up doing a lot of the data crunching with https://aws.amazon.com/athena/ ... I think I spent like $1500 or something. I vaguely remember looking at Hadoop and deciding it was too complicated for the project requirements. Any way you swing it, you need to get your data into "columnar format" (basically CSVs), and then stick them on a hard drive somewhere (in my case I used S3 because I think that is what Athena ingests).

Rizful.com on Nostr: It's been a few years, but I did a project over about 6 months .. I think around ...