Install Hadoop on Mac Standalone Mode

Recently I was working on some data mining tasks using MapReduce, and I need to setup Hadoop on my Mac. I attempted several tutorials and I want to share my experience with it for others to avoid making mistakes.

Do not attempt this one! It was recommended by my instructor but it caused so many different errors including cant start namenode, ssh failures and cant install your app jar into hadoop. It was very outdated as well.

This one mostly works, but it does have a few gotchas:
1. when you create a new SSH key, make sure to use ssh-add to make it available for ssh daemon so you dont have to type your password everytime
2. after installing, the file explorer web UI can’t create files/folders etc because of permission issue, if it’s your personal computer, just do hdfs -chmod 777 /YOUR_HADOOP_DIR to make it work.
3. specifically on Mac which by default is a case-insensitive system, simply run hadoop jar might give you an error like:

Exception in thread “main” Mkdirs failed to create //META-INF/license

because it doesn’t know how to create both LICENSE and license. To solve this, follow instructions here.

Also, a quite useful link to debug common issues here.

Distributed System Learning Tool (Updating)

Recently I am taking the Cloud Computing Specialization course series on Coursera. I was very intrigued by the different concepts in distributed systems. However, while I was working on the homeworks, I found it a bit tough sometimes to compute attributes like Lamport timestamp when I am not being able to visualize the message flows between distributed processes. Thus, I will start working on a visualization tool and automatic solver for these type of problems.

It aims to provide the following features:
– Drag and drop message flows between processes
– (For unicast messages) Automatically solve Lamport and Vector timestamps based on message flows
– (For multicast messages) Automatically solve orderings including FIFO and Causal

The base of the tool, e.g. its visualization canvas, the graph data structures should be extended to visualize and solve other types of problems such as the Global Snapshot Algorithm.

This post will be updated accordingly (Hopefully I will have enough time) …

Apache Spark vs Apache Storm

Recently I am taking the Cloud Computing Specialization MCS course on Coursera for fun and gaining breadth on distributed systems.

One thing I recently learned about is Apache Storm, which is a Distributed Stream Processing framework. At the first glance, I wondered how it is different from the popular Apache Spark; so I did a little bit of research on this, and I found the two comparison charts from here to be quite useful.

Comparison in different aspects
Choice of framework in specific scenarios

So basically, the major difference seems to be brought out by their fundamental architectures: Spark uses HDFS, meaning it’s also possible for batch processing.

(Interesting fact: the Coursera course was developed in 2014 whereas Spark was released in 2015; not a coincidence that such a powerful framework was not mentioned 🙂 )

New Year, New Website

Just finished revamping my website.

I will try to be more active on sharing my thoughts and ideas 🙂

This year’s going to be a year of change!