17Feb/110

Love ASCII Art

I saw this ASCII art in a comment on YouTube and thought it was so awesome and wanted to share it:

love ascii art

I know it's probably all over the web and been around forever, but this is the first time I've seen it, and I LOVE it! Enjoy!

Filed under: Geekery No Comments
4Feb/110

Good Reads: High-Availability Storage Systems and Recommendation Algorithms

I have three papers today that are relevant to a project I'm working on. They're all under 10 pages and easy to read.

netflix.comFirst is a white paper by Siddharth Anand, Netflix’s Transition to High-Availability Storage Systems. This paper makes many of the points I've been trying to coalesce in my mind and communicate recently regarding NoSQL, but far more eloquently. I particularly enjoyed the two sections at the end, best practices and challenges of SimpleDB. One really can't go wrong with clear, concise lists such as he has written, and some of them actually made me snicker out loud as I imagined the consternation they could cause in some DBAs I've known over the years.

amazon.comWhile I've been very interested in the changes Netflix has been making moving to the Amazon AWS cloud. At the same time, I personally find their movie recommendation system to be frustrating and annoying, and so the second paper I have is on the Amazon.com recommendation algorithm. This paper is older, written in 2003, but still very relevant today. Amazon uses item-to-item collaborative filtering; achieving scalability by pushing the expensive operations to off-line computations and thus simplifying the real-time recommendation look-up. An algorithm building on that was presented recently, with a paper examining the YouTube.com scalable video recommendation system that was adopted about a year ago. youtube.comYouTube computes recommendations off-line with a series of MapReduce computations on the user graph of signals, building up a recommendation store in BigTable for fast real-time retrieval.

[Credit to a post by Gred Linden providing good food for thought.]

29Jan/110

Two Tricks: Android Memory Leaks and Red Hat RPM Circular Dependencies

I learned two nifty tricks this week that are totally unrelated to each other.

One: if you get into a circular dependency when installing Red Hat RPMs, you can list them all in the install command at once. I think I knew this in the past, but it's been a long time since I had that problem, so I'd forgotten it.

rpm -ivh package1.rpm package2.rpm

Thanks to GeekGoesMeow for that. I was installing the gd-devel RPM, the exact situation as mentioned in one of the comments. It was maddening!

Two: quickly find one type of memory leak in your Android app by rotating your phone from vertical to horizontal orientation repeatedly. If you're storing Context, or anything derived from Context like an Activity, you will quickly encounter a Java OutOfMemoryError because on each context switch the previous context is not being properly garbage collected. Calling System.gc() won't help. Pass your Contexts and Activities from method to method, don't store them. This tip is thanks to Yusuf Saib speaking at a Silicon Valley Android Developers Meetup.

Filed under: Geekery No Comments
22Jan/110

Cassandra 0.7 and Hector for Noobs, Part 2 Amazon

This is part two of the series about bringing up a 3 node Cassandra cluster running on Amazon EC2 instances with a modest Java program using the Hector client and the cluster. The first part was an overview covering a variety of topics, including bringing up Cassandra for the first time, the Cassandra command-line interface, and creating a first keyspace and column family and putting some data into it. This second post will delve into the details of bringing up your first Cassandra cluster on Amazon EC2.

First caveat - this post is not advice for a production Cassandra cluster, it is just my experience playing around with Cassandra and Amazon EC2 to create a small 3-node cluster to tinker around with for fun. It gives me an opportunity to record my experiences and the traps I fell into and how you might avoid or work around them. Second caveat - just to be clear, Amazon EC2 is not free, it is a paid service and you are responsible for any costs you may incur. You will incur costs if you try to replicate my steps. Third caveat - I am a noob at Cassandra and Hector, so please fact-check the material I'm presenting here. I will give links to point you to authoritative information where ever possible.

Documentation for Cassandra

Since my last post on Cassandra 0.7, new and very helpful documentation has cropped up that you might want to peruse:

Documentation for Amazon EC2

Here are the links to the main documentation for Amazon EC2 that I will reference throughout this article:

Basic Steps for an Experimental Amazon EC2 Cassandra Cluster

The basic steps I'll cover in this post are as follows:

  1. Set up your Amazon account, get familiar with EC2 and EBS
  2. Choose your base AMI
  3. Set up your base system
  4. Storing your base system as an EBS-backed AMI
  5. Configuring and launching instances
  6. Testing your Cassandra cluster and some basic interactions with it
  7. STOP your instances (you don't want to forget this step)

This is the method I've devised for my own personal investigation of Amazon and EC2, entirely for learning purposes. As I've worked my way through this process, I've discovered that there are many ways to achieve this same result. I link to some of them below, but keep in mind that there are lots of ways to do this, and mine is definitely not the most refined.

Amazon EC2 Setup

If you haven't already, head over to aws.amazon.com to set up your Amazon EC2 account and be sure to check out the pricing. The work your way through the entire Getting Started with EC2 guide, including bringing up your first instance. At the point where you're given a choice between Linux and Windows, this post covers the Linux option, but of course you should choose whatever you prefer.

Now you should be up-to-speed on launching an Amazon EC2 instance from a provided AMI. EBS is Elastic Block Storage and we will be using this storage to store our own AMI to use to launch our Cassandra instances from. You do not need to set up an EBS account or create special keys for EBS.

Through the remainder of this post, you can find more detail on all of the Amazon topics covered by referencing the EC2 guide. It's loaded with information. You should use that guide as your main source of information and anything I write in the remainder of this post should be taken with a grain of salt and fact-checked against the guide.

Choose Your Base AMI

Now that you have an idea of the AMIs available by default from Amazon after running through the tutorial, it's time to see the plethora of AMIs that are public and choose one to start with. From within your management console you can view instances by clicking the "Launch Instance" button and going to the "Community AMIs" tab. From there you can see thousands of AMIs. As a word of caution, you should do research on any AMI you are thinking of launching to be sure it's safe and reliable and not pre-infected with worms, root kits, etc.

For my own use, I like and am familiar with Ubuntu, so I'm using an Ubuntu AMI as my base. I chose my AMI after perusing the Lucid Lynx list, because Lucid is the most recent long-term support release of Ubuntu. You can see the Ubuntu AMI list here. I chose a Small instance from the list since I wanted to keep the costs down while I was just experimenting.

AMIs come in many flavors: Standard, Micro, High-CPU, High-Memory, Cluster Compute and Cluster GPU. Within each flavor, are many specific offerings at varying price levels. I'm using a Standard - Small, but you should note that Small is just barely sufficient for Cassandra and doesn't have a 64-bit option. Standard has these characteristics:

  • 1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit)
  • Memory: 1.7 GB
  • 160 GB instance storage (150 GB plus 10 GB root partition)
  • Platform: 32-bit
  • Moderate I/O

The Cassandra documentation says this about EC2:

On EC2, the best practice is to use L or XL instances with local storage. I/O performance is proportionately much worse on S and M sizes, and EBS essentially doubles your dependence on the already-overcrowded EC2 network...

For more thoughts on EC2 with Cassandra, I found this slideshare.

Armed with this information, give some thought to what interests you, so out and do a little research, and decide on an AMI to use and take note of it's ID. You will want to choose an EBS-backed AMI rather than an S3-backed AMI, since this post goes down the EBS path. If you prefer S3-backed AMIs, then you'll need to make your own adjustments as you follow along with this post. To simplify matters, rather than hunting down the perfect AMI at this point, you can always base your system off of one of the AMIs in the tutorial you followed in the first step.

Set Up Your Base System

Now that you've chosen an AMI and know it's ID, you can launch it in the EC2 Management Console. Note that it appears it's possible to do everything via command line that you can do in the console, but we'll stick to the console for now. I'm not going to cover the details of launching the instance from the AMI ID, just click the "Launch Instance" button and then follow the wizard. If you have questions, the EC2 guide should be able to answer them.

Alternatively, if you want a more streamlined approach, have a look at this tutorial on the Cassandra wiki (and here's yet another approach, though older). My approach is not nearly as streamlined, but my goal was more in the way of educating myself about each step, whereas the tutorial seems to be more oriented toward getting you up and running as quickly as possible. Choose the approach works best for you. Please note that Debian packages are not always the most recent version of software and Debian has a tendency to store files all over the place (also true for Ubuntu), so you may need to do some hunting to locate your configuration files.

Assuming you're sticking with this guide and not the "quick start", we'll continue on with the experiment. After your instance is launched, then you will want to do the following:

  • Log onto your instance and make sure it's what you wanted
  • Run any OS updates that might be needed. In Ubuntu you would do this using apt-get.
  • Install Java 6
  • Download and install Cassandra
  • Configure Cassandra - basic, not multi-node
  • Start Cassandra to verify it works
  • Clean up Cassandra files!

You can log onto a Linux instance using ssh, using a command similar to this:

ssh -i /home/myname/.gnupg/mynameEC2key.pem ubuntu@ec2-123-23-123-123.compute-1.amazonaws.com

Where "ec2-123-23-123-123.compute-1.amazonaws.com" is the host name given to your instance after launch. You can view the hostname and IP address in the management console. "Ubuntu" is the default user for the Ubuntu AMI. The "-i /home/myname/.gnupg/mynameEC2key.pem" specifies the key you chose when launching your instance.

Log on, take a look around, and verify that the instance you've launched is what you were expecting from the AMI you chose.

Next follow whatever procedure you would normally follow to be sure you have an up-to-date system. In Ubuntu and Debian Linux systems, you would run commands such as this:

sudo apt-get upgrade
sudo apt-get update

Next, install Java 6, if it's not already there, using your usual mechanism for installing Java on your chosen system. I would recommend not trying to use OpenJDK or GCJ. In Ubuntu, you would use a command like "sudo apt-get install sun-java6-jdk".

Load the Cassandra release onto your instance, using whatever method you prefer, and choose a location to install it. Be sure to verify your download using the PGP, MD5 or SHA1 key. Then extract the tar.gz file.

The IP addresses you should use will come from the instance info for each instance you will launch later, so for this configuration, set Cassandra up as a single-node cluster. Please refer to Datastax instructions for setting up a single-node Cassandra cluster, making sure to set the token to 0.

  • Note that mx4j-tools.jar does not come with your Cassandra download, so download that separately and store it in the Cassandra lib folder. You will want it!

What we are doing is starting Cassandra up and verifying that it works, but then we will clean it up afterward so that we can launch multiple instances in a clean state. This is important because Cassandra stores information about nodes when it first starts up, and since we want to have a clean AMI that we can launch repeatedly, we don't want this data hanging around, see this discussion.

At this point you can use the Cassandra client to try creating a keyspace, to verify that it works. Since you're tailing the Cassandra logs in your first ssh connection, make a new ssh connection to do this.

bin/cassandra-cli -host localhost -port 9160
create keyspace Keyspace1 with replication_factor = 1 and placement_strategy = org.apache.cassandra.locator.RackUnawareStrategy;

View the log messages in your first window to observe that the keyspace is created.

Now, stop Cassandra and clean up the files:

sudo rm -rf /var/lib/cassandra/data/
sudo rm -rf /var/lib/cassandra/commitlog/
sudo rm -rf/var/lib/cassandra/saved_caches/

Now it will once again be as if Cassandra had never been started on your system. This is now your functional base instance.

Store EBS-Backed AMI

Now we're ready to store this baby and then it will be available to launch instances that will be ready to run with a little bit of configuration. Go back to your Amazon EC2 management console and see that below instances there is Elastic Block Store section. My current understanding is that when we create an EBS-backed AMI, it will create a snapshot, and when we launch each instance from that AMI snapshot, then a volume will be created assigned to that instance. I'm new to this, so I recommend reading the guide yourself and following the instructions, see these sections:

  • AWS Documentation » Amazon EC2 » User Guide » Using Amazon EC2 » Using AMIs » Creating Your Own AMIs » Creating Amazon EBS-Backed AMIs
  • AWS Documentation » Amazon EC2 » User Guide » Using Amazon EC2 » Using Amazon EBS-Backed AMIs and Instances

Configuring and Launching Instances

Now that you have an EBS-backed AMI with Java and Cassandra installed on it, we will launch three instances. The first one will be the seed node.

In the management console, in the Instance interface, click the Launch Instance button. Go to the My AMIs tab and select your new AMI and follow the wizard. Some helpful notes:

  • Be sure to give a tag, such as name=Cassandra for each instance, so they will be easily identifiable. For the seed note, you could have name=CassandraSeed to easily find the seed.
  • Be sure to configure the same security group to all three instances. After you launch an instance you cannot change which security group it belongs to, but you can change the details of the security group. If you want them to all be able to communicate with each other using the internal IP addresses (cheapest option), then having the same security group is important.
  • Be sure to launch then in the same availability zone, such as us-east-1a

Launch 3 nodes, one as the seed and two as non-seed nodes.

Now connect to each node using ssh to configure Cassandra. For configuring Cassandra as a multi-node cluster, I want to point you to the Datastax configuration page for setting up a multi-node Cassandra cluster and add a few extra comments. See also this wiki page.

Configure the IP addresses and seeds as in the Datastax instructions, using the management console to get your internal IP addresses, and being careful to be consistent about which one you chose to be the seed.

The token for the seed should be 0, the token for the other nodes is calculated using this formula: i * (2**127 / N) for i = 0 .. N-1, where i is the node number starting from 0, and N is the total number of nodes (or 3 in this case). Using that formula I have these values calculated:

token node1 (seed) = 0
token node2 = 56713727820156410577229101238628035243
token node3 = 113427455640312821154458202477256070485
the value of 2**127 is 170141183460469231731687303715884105728

You can read more about the tokens at the Cassandra wiki.

Now you're ready to launch your nodes. Launch the seed node first, using the familiar command that allows you to tail the log: bin/cassandra -f

Testing the Cassandra Cluster

Now open a new ssh session to one of the three nodes that you've launched and change directory to your Cassandra install.

You can use the Nodetool to verify that your nodes are all working together (this is where that jmx-tools.jar file starts to come in handy).

ubuntu@domU-123-123-123:/opt/apache-cassandra-0.7.0$ bin/nodetool -host localhose ring
Address         Status State   Load            Owns    Token
113427455640312821154458202477256070485
10.207.6.65     Up     Normal  10.59 KB        33.33%  0
10.207.3.113    Up     Normal  10.43 KB        33.33%  56713727820156410577229101238628035243
10.214.47.207   Up     Normal  10.63 KB        33.33%  113427455640312821154458202477256070485

I've had some problems with nodetool that I'll list here. I don't know is I have a buggy version of it, or am missing some crucial bit of information as to how to use and configure it. I think the problems are in some way connected to the way EC2 instances are configured.

  • I've had problems getting nodetool to work using localhost as the hostname and had to resort to an IP address for one of the nodes, and not the node I am ssh'ed into
  • I've had problems with the nodetool not working after I shut the instances down and then restarted them later

You can now use the cassandra-cli to create a keyspace and column family and observe the log messages as you do that. The command for the keyspace is above.

Stop Your Instances

Be sure to stop your instances when you are done to stop the clock on charges.

Issues

  • When you stop your instances, the next time you start them up again, you will need to reconfigure all of the IP addresses. The workaround for this is to pay for static IPs, which is out of the scope of this article.
  • Don't forget to stop your instances, since you pay for the time when they are running.

Conclusion

This was the tedious part - getting the infrastructure ready. In the next post(s), I'll start with Hector and working with the Hector example code to define my schema programatically, and then the real fun will begin. I plan to eventually write a web application and Android app that are going to use this simple Cassandra cluster.

Filed under: Java No Comments
30Dec/100

A New Favorite Web Security Book

My mom, wonderful person that she is, sent me the Web Security Testing Cookbook by Paco Hope and Ben Walther, and it's just delicious. I've been endlessly fiddling with my computer installing new tools and trying out the recipes ever since. This is fun stuff!

Web Security Testing Cookbook

Web Security Testing Cookbook with cats (to enhance the visual effect)

The book is packed with delightful recipes with titles such as "Creating Decompression Bombs", "Subverting AJAX With Injected Data", "Creating Overlays Using XSS", and many more. The book is targeted to software developers and testers who are interested in improving the security of their software and incorporating security tests into their test suite.

The recipes are short and easy to follow and implement, and give a great sense of satisfaction and accomplishment when completed.

One thing I found was that the CAL9000 tool is no longer available and OWASP now suggests using EnDe instead, so I'll need to make adjustments to the recipes that use the CAL9000 tool. If I find any other anomalies worth mentioning, I'll update this post.

Filed under: Security No Comments