Full-Stack Thinking of A Systems Researcher

Here are some thoughts after reading the related materials of my new project: Index traversal with Programmable NIC and RDMA, and some other projects including TVM.

A systems researcher should be able to think at all levels of a computer system, and innovative designs often come from the combination of new demands, new technologies with old infrastructures. For example, the recent progresses in machine learning (deep learning), along with the advances of FPGA and RDMA, has created a lot of interesting opportunities for new system designs.

To build a system from scratch, from a computer engineer’s view, the lowest level should be the hardware architecture. (Does computer organization has some role to play here?) It does not necessarily have to be a CPU, people have designed specialized hardware such as TPU (tensor processing unit by Google), and the use of FPGA has been widely explored to perform various tasks with high performance replacing the CPU. In the past decades people have also used GPU heavily to accelerate tasks.

The level above architecture is the operating system (and networks?). What are the new operating system designs taking those new applications and technologies into account? It might be interesting to look at the research at Prof. Timothy Roscoe’s group.

One level up is the compiler. This is where TVM has targeted at. It is interesting to know how a compiler can be designed for compiling ML progarms, and I will add more after reading the paper. Last semester I took a course, System Construction, which introduced Oberon, a co-design of OS and compiler. This is also a interesting direction to explore in this new era.

Then comes the software framework. A lot of research has been done here, such as Tensorflow, which follows the dataflow model if I understand correctly. Too many things can be done here, considering the features of both the upper level applications and the underlying infrastructure.

Considering there are so many new application areas, not only machine learning, but also others like IoT and mobile computing, it is worth thinking about the opportunities from each level of the stack to improve the performance, availability, accessibility, reliability and security of these applications.

Hopefully this post will be extended by a series, each surveying the work done in one level for various emerging workloads.