Preliminary thoughts on Big Data System Papers

Recently, I have been reading some famous big data system papers that have been put in industry use for quite some time, here is an (incomplete) list of them:

Spark RDD:
Delayed Scheduling:
Spark SQL:
Spark Streaming:

Upon reading them, I have raised my thoughts on what exactly constitutes influence for pure system papers, specifically, how are they different from work culminated from being a Software Engineer?

First of all, it is important to note that some of them are published by corporates such as Yahoo and Databricks, and for those that are not, a significant chunk of them are related to evaluating the performance of big data systems in real-world industries such as Facebook and Yahoo, and many of them are published after years of industry deployment perhaps for proof of improvement.

So how are they different from Software Engineering? Judging from the pure contents of the papers, I don’t think there are much; However, research is about proposing ideas, not engineering itself. Yet, those ideas are mostly inspired, and more importantly, proven useful by industry applications, which consequently are necessitated by a strong collaborating between industry and academics. Therefore, if I were to pursue a PhD or research oriented role in these areas, there needs to be a great deal of collaborating effort between my school and the industry, which unfortunately is usually only dominated by the top ones 🙁

I will shift my focus to network papers soon, but I will continue populating the list above once I read more.

Leave a comment

Your email address will not be published. Required fields are marked *