Running t5x on macos

Background

If you are in language model space, you cannot avoid 🤗 transformers and other libraries from that ecosystem. From my experience, those libraries are top-quality, they provide invaluable service to the community, and the whole 🤗 story is a great example of how open-source has to be. Also, if you are into reading source code and want to get better with complex python codebases, transformers github repo is the place to learn best practices.

Yet, 🤗 transformers is not the only library in the space, and some others might suit better for more special cases. Prior to 🤗 transformers, you could stick to t2t which seems to be succeeded by trax these days, fairseq was also used to produce many papers and models. Several of the modern high-performing LMs, such as flan-t5-family and ul2 were pretrained with t5x.

A weekend of ops'ing (a rant)

Some wrap up of experience of past two days. I tried to automate deployment of kubeflow on to a self-built kubernetes cluster with ansible. 5/7 experience with partial success so far.

I generally like the idea of kubeflow, which on contrary to Sagemaker lets me work and test small things out while not being constantly connected, which feels great while on a train or a plane. I wanted to play more extensively with it so I could not only share my data-science related experience, but also understand if deploying it is easy enough to be handled by a product team and won’t burden some dedicated in-company platform people.

Good pieces

Ansible got much better since last time I went through its docs to get a general feeling on how things are working. The modules became more idiomatic, there’s much less need to plug raw commands. Some interfaces became nicer (writing a list of packages to be installed with apt makes more sense than doing it through with_items). Asserts seemed a great idea, but after some time I found them more distracting than helpful. Inventory plugins are really cool. No more need for a separate magical scripts to get the list of host. Yet, if you’re not a AWS or GCP user, you might get unpleasantly surprised with the quality of those, scaleway plugin works, but feels unpolished and needs experimentation to make things go as expected. In the end I managed to create a flow of playbooks creating me a set of machines and then provisioning those, which looks like one step before a pretty nice scaling automation.

Setting kubernetes to the point of it being able to run some pods was easy, kubeadm does all the work for you. I felt, it was the only easy part about kubernetes :)

Challenging pieces

I understand, that most of it is about me not reading enough about some specific technology/library/component.

Kubernetes documentation is visually nice. Content-wise it needs a lot of improvements and more thorough explanations. I don’t like parts where you are supposed to take some command and run it just because you were told to. Because of that I had to go through pieces of cgroups documentation, which was not bad, installing kubernetes implies good knowledge of linux ecosystem, but the part about network add-ons is mind-breaking, as you’re given 6 alternatives without a proper decision strategy to choose one over the others and next to it you’re given a link with 10+ more, 1-liner description for each. Yet, going with this magic made things work.

Main assumptions

Validity
Additivity
Linearity
Independence and equal variance of errors (we’re talking about OLS)
Normality of errors (might not be present, but its distribution has to be understood)

Checklist before modeling

Distribution of the target variable
Distributions and correlations between exogenous variables

Alexey Kuntsevich