blog

Recent learning on (Computer) Networking from practice

Today I just finished implementing a heterogenous network simulation environment using Mininet for my most recent work on implementing a traffic migration framework. Although I have used Mininet before, this past few weeks have really been a roller coaster and I have learnt so much more than what I had before.

My latest Mininet environment consists of a simulated geo-distributed networks with several Network Functions, OSPF router, BGP router, load balancer (L2 and L3/4), and programmable switches (powered by OpenFlow) in general. They are implemented to show the genericness of our framework.

Besides the usual Mininet APIs, I also got my hands dirtyby actually tuning industry routing softwares, specifically with the Quagga software package. I learnt to explore and configure OSPF and BGP rules from scratch, as well as configuring public and private IP namespaces. For the programmable switches, I have deep-dived into the OpenFlow protocol and implemented a Ryu app that can explore the topology, do path discovery and actively install OpenFlow entries to switches for heterogenous packets flows (IPv4 and ARP mostly).

After all, I think the biggest impact to me was to become more truly understand what Network Functions are. They are basically Software Defined modules that sit either in the control plane and even data plane (Now P4 is talking!) and manipulate switch tables using the intelligence of a general/specific purpose computing environment (e.g. Linux).

Though it was a lot of pain figuring out how these so many new things work (can’t forget that I spend half a day figuring where the hell was the command “show ip ospf” is …), I can finally say that it was worth it! No pain no gain!

Thoughts on writing paper, as a novice

We have finally finished the draft of our paper for OSDI 2020, a presumably top conference, and this is the first paper I am extensive involved from idea and implementation to evaluation and writing! while hoping that it can be accepted, I just want to save some notes here regarding what I have learned from this experience.

  1. Presentation is perhaps the most important part of a paper. At the beginning, I was in fact a bit worried that the technical depth of the paper might not be sufficient and the central idea might seem a bit simplified; However, by emphasizing the novelty of the use cases and extensive evaluation of improvement, we are able to present to the audience that we have done extensive work to verify the usefulness of the system.
  2. Figuring out the main contributions is crucial to not get tormented into endless thoughts of improvements and extensions. As a software engineer, I think about engineering features such as maintainability, testability and scalability a lot, however, that would, and it did, distract me from focusing on the true contribution we have in the paper. Everything revolves around the main contributions.
  3. Evaluations are in-fact quite important. Well, it sounds obvious if your paper stands on certain evaluation such as case studies, but it seems less obvious that some supporting features for the central case studies also needs to go under extensive evaluations (in our case, it was the throughput and latency benchmarks), I was kinda opposing this work because I felt it was somewhat obvious and unnecessary as I was, ironically, too drawn into by 2). However, it was latter proven useful as an important and handy supporting evidence that our system can function efficiently and properly under various types of workload that requires certain amount of performance. So in conclusion, doing benchmarks are torturing, but necessary 🙂

Hope I can stop myself from falling into some of those traps when I work on the next project, and again, fingers crossed for this one!

Preliminary thoughts on Big Data System Papers

Recently, I have been reading some famous big data system papers that have been put in industry use for quite some time, here is an (incomplete) list of them:

Spark RDD: https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
Delayed Scheduling: http://elmeleegy.com/khaled/papers/delay_scheduling.pdf
Spark SQL: https://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf
Spark Streaming: https://people.csail.mit.edu/matei/papers/2013/sosp_spark_streaming.pdf
Mesos: https://people.eecs.berkeley.edu/~alig/papers/mesos.pdf
Yarn: https://www.cse.ust.hk/~weiwa/teaching/Fall15-COMP6611B/reading_list/YARN.pdf
ZooKeeper: https://www.usenix.org/legacy/event/atc10/tech/full_papers/Hunt.pdf

Upon reading them, I have raised my thoughts on what exactly constitutes influence for pure system papers, specifically, how are they different from work culminated from being a Software Engineer?

First of all, it is important to note that some of them are published by corporates such as Yahoo and Databricks, and for those that are not, a significant chunk of them are related to evaluating the performance of big data systems in real-world industries such as Facebook and Yahoo, and many of them are published after years of industry deployment perhaps for proof of improvement.

So how are they different from Software Engineering? Judging from the pure contents of the papers, I don’t think there are much; However, research is about proposing ideas, not engineering itself. Yet, those ideas are mostly inspired, and more importantly, proven useful by industry applications, which consequently are necessitated by a strong collaborating between industry and academics. Therefore, if I were to pursue a PhD or research oriented role in these areas, there needs to be a great deal of collaborating effort between my school and the industry, which unfortunately is usually only dominated by the top ones 🙁

I will shift my focus to network papers soon, but I will continue populating the list above once I read more.

Try out new things!

COVID-19 has forced me to stay home for perhaps the next 5 months 🙁 Besides daily research and other chores, I am going to pick up some skills!

  1. Learn a new language
    I have decided to pick French on Duolingo. Mostly because I was heavily embarrassed during my travel in France as a lot of people there don’t speak English well 🙁
    (unlike people in German or Switzerland, or even Italy, who can at least understand the words that are coming out of my mouth)
  2. Learn a new instrument
    I have decided to pick on Ukulele because 1) it is portable, 2) you can play tabs or sing along and 3) I can inherit some of my guitar experiences from couple of years ago (shame that I didn’t keep it going…)

By the time of writing, I have been practicing a little bit of both everyday and hopefully by end of this year, I would be somewhat an expert 🙂

Cassandra Internals

Consistency level:

Tuneable with trade off from availability in the CAP theorem. Provides Eventual Consistency.
The highest consistency level does not guarantee perfect consistency due the lack of rolling back or WAL (write ahead logs) mechanism found in tradition RDBMS.

For more, read here.

Partition/Hashing mechanism:

Similar to the Chord protocol.

For more, read here

Membership Protocol (Failure detection):

Similar to the Gossip membership protocol.

For more, read here.

Request ordering:

Does not use Causality-based method such as Lamport/Vector timestamps.
Instead uses a “Last one wins” competition strategy and use clock synchronisation.

Fore more, read here.

More concepts to be added….

Install Hadoop on Mac Standalone Mode

Recently I was working on some data mining tasks using MapReduce, and I need to setup Hadoop on my Mac. I attempted several tutorials and I want to share my experience with it for others to avoid making mistakes.

Do not attempt this one! It was recommended by my instructor but it caused so many different errors including cant start namenode, ssh failures and cant install your app jar into hadoop. It was very outdated as well.

https://medium.com/@jayden.chua/installing-hadoop-on-macos-a334ab45bb3

This one mostly works, but it does have a few gotchas:
1. when you create a new SSH key, make sure to use ssh-add to make it available for ssh daemon so you dont have to type your password everytime
2. after installing, the file explorer web UI can’t create files/folders etc because of permission issue, if it’s your personal computer, just do hdfs -chmod 777 /YOUR_HADOOP_DIR to make it work.
3. specifically on Mac which by default is a case-insensitive system, simply run hadoop jar might give you an error like:

Exception in thread “main” java.io.IOException: Mkdirs failed to create //META-INF/license

because it doesn’t know how to create both LICENSE and license. To solve this, follow instructions here.

Also, a quite useful link to debug common issues here.

Distributed System Learning Tool (Updating)

Recently I am taking the Cloud Computing Specialization course series on Coursera. I was very intrigued by the different concepts in distributed systems. However, while I was working on the homeworks, I found it a bit tough sometimes to compute attributes like Lamport timestamp when I am not being able to visualize the message flows between distributed processes. Thus, I will start working on a visualization tool and automatic solver for these type of problems.

It aims to provide the following features:
– Drag and drop message flows between processes
– (For unicast messages) Automatically solve Lamport and Vector timestamps based on message flows
– (For multicast messages) Automatically solve orderings including FIFO and Causal
– TBD

The base of the tool, e.g. its visualization canvas, the graph data structures should be extended to visualize and solve other types of problems such as the Global Snapshot Algorithm.

This post will be updated accordingly (Hopefully I will have enough time) …

(UPDATE 2020/05: Gosh I still haven’t got started on this, needs to catch up now lol)

Apache Spark(-streaming) vs Apache Storm

Recently I am taking the Cloud Computing Specialization MCS course on Coursera for fun and gaining breadth on distributed systems.

One thing I recently learned about is Apache Storm, which is a Distributed Stream Processing framework. At the first glance, I wondered how it is different from the popular Apache Spark; so I did a little bit of research on this, and I found the two comparison charts from here to be quite useful.

Comparison in different aspects
Choice of framework in specific scenarios

So basically, the major difference seems to be brought out by their fundamental architectures: Spark-streaming still uses RDD , meaning it’s also suitable for batch processing; In fact, Spark-streaming itself is not exactly stream processing, but micro-batching processing, it curates data over a short span of time (300ms to 10s etc) and process it just like batches.

(Interesting fact: the Coursera course was developed in 2014 whereas Spark was released in 2015; not a coincidence that such a powerful framework was not mentioned 🙂 )

New Year, New Website

Just finished revamping my website.

I will try to be more active on sharing my thoughts and ideas 🙂

This year’s going to be a year of change!