While most of the deep learning engineers/enthusiasts tend to focus on algorithms, they often forget about the hardware they use for training/inference. If you were to ask them about why a GPU/TPU is faster than a cpu, you’ll often hear responses like “GPUs are optimized for convolutions or GPUs can run more threads”. While these statements are true, they merely scratch the surface of what goes on underneath. In this post, I tried to dig in a little deeper into the hardware to explain what’s exactly going on. I’m not a hardware pro by any means, but I feel this information is critical for all AI enthusiasts.
- CPUs are meant to be the most flexible piece of hardware, capable of running every software, instruction by instruction
- CPUs are not designed to only render graphics or multiply tensors, they need to load databases, a variety of applications and run multiple threads where each thread is running a different instruction set.
- To accomplish this, CPUs read instruction one by one from the memory, perform any computation if needed, and write the result back into memory.
- GPUs are designed to process a single instruction simultaneously over a large number of cuda cores.
- Though the cores have lower clock speed , the sheer number of cuda cores is enough to crush CPU when it comes to tasks like training deep learning models
- TPUs pushes the envelope as far as throughput goes, by adding matrix multiply units (MMU), which are physically connected to each other.
- These MMUs are meant to avoid to memory access during a chain of tensor product operations. This is accomplished by a systolic array architecture.