15 hours?

Ramya Namuduri
Apr 19, 2021
2 min read

I don't believe I got the memo.

Time is essential. Time is efficiency. Time is half of everything.

When I read the research paper, "Attention Is All You Need", I did not believe them when they claimed they took almost 4 days to train their model which was already running on several GPUs. I ignored the fact because - I'm not completely sure. First of all, their application was different since the amount of data they had was enormous with good training practice requirements and a bulky transformer, all of which I wanted to avoid.

With transformers, I don't think that is possible. It is their nature to be bulky, with layers of encoders which each have layers inside them, including multi-head attention where it is comparable to a Hydra monster. The only similarity is that they are scary, with multiple heads, each head has its own neurons and an entanglement of sorts. The analogy ends there, however. Regardless, after these giant matrices pass through the encoders, they realize they have only gone through half the transformer, and sigh deeply - we realize that it has been 15 hours since we began training the model and that we are still on the first epoch or run-through. Then, we must brave through the decoder side of the journey, the more thickety part, where we meet pythons and anacondas (no puns intended). It is true as we keep sending words back and forth through the decoder, output and input, future and past, such that we lose track of time. After this, we still have not completed the first epoch, since we have to grade our results against the key, and see how we did.

The issue is that the first time the model runs, the outputs are usually all random because the model is untrained and has no idea what to do. What does summarization mean? What does context mean? The model simply shrugs to such questions. Yet, to simply randomly traverse through this mess of a model, it takes forever, and finally, we must back-propagate to see where the model went wrong. That is why, it is a good time to abort-mission before the CPU cries out, collapses and crumbles - unlike Atlas.

So what do I do? First, there is no guarantee I will take four days to train. In fact, I know I won't because I don't have 8 GPUs. I have none. Second, I have to cut down my training sample. Third, reducing the number of heads in the multi-head attention will help, certainly. I learned that the more the neurons a neural network, the more it emulates our brain, and that is why it is efficient to use a GPU, so in the case of not having a GPU, reducing the number of neurons would decrease training time. Or...I could perhaps invest in a GPU for a brief time, the amount of time it takes for my model takes, if it is taking me an exorbitant amount of time and effort to train the model.

15 hours?

Recent Posts

Comments