We have finally finished the draft of our paper for OSDI 2020, a presumably top conference, and this is the first paper I am extensive involved from idea and implementation to evaluation and writing! while hoping that it can be accepted, I just want to save some notes here regarding what I have learned from this experience.
Presentation is perhaps the most important part of a paper. At the beginning, I was in fact a bit worried that the technical depth of the paper might not be sufficient and the central idea might seem a bit simplified; However, by emphasizing the novelty of the use cases and extensive evaluation of improvement, we are able to present to the audience that we have done extensive work to verify the usefulness of the system.
Figuring out the main contributions is crucial to not get tormented into endless thoughts of improvements and extensions. As a software engineer, I think about engineering features such as maintainability, testability and scalability a lot, however, that would, and it did, distract me from focusing on the true contribution we have in the paper. Everything revolves around the main contributions.
Evaluations are in-fact quite important. Well, it sounds obvious if your paper stands on certain evaluation such as case studies, but it seems less obvious that some supporting features for the central case studies also needs to go under extensive evaluations (in our case, it was the throughput and latency benchmarks), I was kinda opposing this work because I felt it was somewhat obvious and unnecessary as I was, ironically, too drawn into by 2). However, it was latter proven useful as an important and handy supporting evidence that our system can function efficiently and properly under various types of workload that requires certain amount of performance. So in conclusion, doing benchmarks are torturing, but necessary 🙂
Hope I can stop myself from falling into some of those traps when I work on the next project, and again, fingers crossed for this one!
Upon reading them, I have raised my thoughts on what exactly constitutes influence for pure system papers, specifically, how are they different from work culminated from being a Software Engineer?
First of all, it is important to note that some of them are published by corporates such as Yahoo and Databricks, and for those that are not, a significant chunk of them are related to evaluating the performance of big data systems in real-world industries such as Facebook and Yahoo, and many of them are published after years of industry deployment perhaps for proof of improvement.
So how are they different from Software Engineering? Judging from the pure contents of the papers, I don’t think there are much; However, research is about proposing ideas, not engineering itself. Yet, those ideas are mostly inspired, and more importantly, proven useful by industry applications, which consequently are necessitated by a strong collaborating between industry and academics. Therefore, if I were to pursue a PhD or research oriented role in these areas, there needs to be a great deal of collaborating effort between my school and the industry, which unfortunately is usually only dominated by the top ones 🙁
I will shift my focus to network papers soon, but I will continue populating the list above once I read more.
COVID-19 has forced me to stay home for perhaps the next 5 months 🙁 Besides daily research and other chores, I am going to pick up some skills!
Learn a new language I have decided to pick French on Duolingo. Mostly because I was heavily embarrassed during my travel in France as a lot of people there don’t speak English well 🙁 (unlike people in German or Switzerland, or even Italy, who can at least understand the words that are coming out of my mouth)
Learn a new instrument I have decided to pick on Ukulele because 1) it is portable, 2) you can play tabs or sing along and 3) I can inherit some of my guitar experiences from couple of years ago (shame that I didn’t keep it going…)
By the time of writing, I have been practicing a little bit of both everyday and hopefully by end of this year, I would be somewhat an expert 🙂
Tuneable with trade off from availabilityin the CAP theorem. Provides Eventual Consistency. The highest consistency level does not guarantee perfect consistency due the lack of rolling back or WAL (write ahead logs) mechanism found in tradition RDBMS.
Recently I was working on some data mining tasks using MapReduce, and I need to setup Hadoop on my Mac. I attempted several tutorials and I want to share my experience with it for others to avoid making mistakes.
Do not attempt this one! It was recommended by my instructor but it caused so many different errors including cant start namenode, ssh failures and cant install your app jar into hadoop. It was very outdated as well.
This one mostly works, but it does have a few gotchas: 1. when you create a new SSH key, make sure to use ssh-add to make it available for ssh daemon so you dont have to type your password everytime 2. after installing, the file explorer web UI can’t create files/folders etc because of permission issue, if it’s your personal computer, just do hdfs -chmod 777 /YOUR_HADOOP_DIR to make it work. 3. specifically on Mac which by default is a case-insensitive system, simply run hadoop jar might give you an error like:
Exception in thread “main” java.io.IOException: Mkdirs failed to create //META-INF/license
because it doesn’t know how to create both LICENSE and license. To solve this, follow instructions here.
Also, a quite useful link to debug common issues here.
Recently I am taking the Cloud Computing Specialization course series on Coursera. I was very intrigued by the different concepts in distributed systems. However, while I was working on the homeworks, I found it a bit tough sometimes to compute attributes like Lamport timestamp when I am not being able to visualize the message flows between distributed processes. Thus, I will start working on a visualization tool and automatic solver for these type of problems.
It aims to provide the following features: – Drag and drop message flows between processes – (For unicast messages) Automatically solve Lamport and Vector timestamps based on message flows – (For multicast messages) Automatically solve orderings including FIFO and Causal – TBD
The base of the tool, e.g. its visualization canvas, the graph data structures should be extended to visualize and solve other types of problems such as the Global Snapshot Algorithm.
This post will be updated accordingly (Hopefully I will have enough time) …
(UPDATE 2020/05: Gosh I still haven’t got started on this, needs to catch up now lol)
One thing I recently learned about is Apache Storm, which is a Distributed Stream Processing framework. At the first glance, I wondered how it is different from the popular Apache Spark; so I did a little bit of research on this, and I found the two comparison charts from here to be quite useful.
So basically, the major difference seems to be brought out by their fundamental architectures: Spark-streaming still uses RDD , meaning it’s also suitable for batch processing; In fact, Spark-streaming itself is not exactly stream processing, but micro-batching processing, it curates data over a short span of time (300ms to 10s etc) and process it just like batches.
(Interesting fact: the Coursera course was developed in 2014 whereas Spark was released in 2015; not a coincidence that such a powerful framework was not mentioned 🙂 )