6Oct/106

Cassandra 0.7 and Hector for Noobs

I've been fiddling around with Cassandra 0.7 Beta2 and the Hector Java client (because I must be a closet masochist messing with a beta [update Cassandra 0.7 is no longer in beta, so please give it a try!]). Since documentation for these is seriously lacking [but getting better], I decided to write up my discoveries and observations in the hopes of helping out other noobs like myself. My ultimate goal, by the end of this post, is to bring up a 3 instance cluster of Cassandra nodes running on Amazon EC2 instances and have a modest Java program using the Hector client and the cluster.

Before you read any further, I want to stress that the Cassandra documentation has very minimal detail and the same goes for Hector, and as of yet, I've found no online tutorials at other web sites for version 0.7. This adventure is not for the faint of heart!

If you've decided to plunge ahead anyway, then you should first study the following documents very carefully, because going forward, I'm assuming you're not a Java noob and you're up-to-speed with at least this much introductory material:

If you find newer and/or better documents online, please list them in the comments. Thanks!

After you've gotten up to speed with those, then the fun begins!

Download and install Cassandra 0.7 Beta2: apache-cassandra-0.7.0-beta2-bin.tar.gz and run it as a single node on your localhost if possible. The code examples we'll try first make the assumption you're running on localhost and using the default port of 9160.

I recommend spending a little time in the cassandra-cli command line interface tool, experimenting with the commands and becoming a little more familiar with Cassandra. The help command inside the tool is your best bet at the moment for discovering the commands you can try. The cassandra.yaml configuration file gives you the name of the default keyspace, Keyspace1, and the column families to try querying. (When I did this, I found my queries didn't seem to be working as I'd expected, but I stubbornly moved along to the next step anyway.)

After getting Cassandra running, the first thing we can try is to get zznate's Hector example code from github and have a go at making those run. Either download the zip file or "git clone" the repository.

[Note... I use Eclipse and have the Maven 2 m2eclipse plugin installed, and am developing on Ubuntu Lucid Lynx. I have 4 GB of memory with a dual core CPU, running on a notebook computer. You may choose to use whatever hardware, tools and OS you prefer, but my observations are based on this environment. Cassandra wants to use a lot of memory, so please take that into account when configuring your development environment.]

I imported the mavenized project, hector-examples, into Eclipse. Because I'm using Maven and Riptano has graciously provided a maven repository for Hector [update: as of 1/11/2011 Hector is on Maven Central repository and the Riptano repository is deprecated], lots of magic happens at this point. Once Maven finished doing its thing, I immediately found 2 issues that needed to be resolved:

  1. I needed to update the hector dependency from 0.7.0-17 to 0.7.0-18 in pom.xml
  2. The DeleteBatchMutate class didn't compile due to using org.apache.cassandra.thrift.Clock, which has been removed from Thrift, so I needed to change the code to use long instead of Clock.

Hopefully, by the time you read this, those issues will already be fixed.

Next, I started trying to get all of the examples to run and then spend some time modifying them to learn more about how Cassandra and the client work. I quickly found that the examples did not run due to no keyspace, column families or super column families being configured. How did I determine that? After observing the error messages when running the examples, I found the PID of Cassandra and then ran jconsole and inspected the MBeans and saw I had nothing configured.

[Note: Cassandra does not come with the mx4j-tools.jar already included in the lib folder, and Cassandra happily notifies you of that when you start it up, so you'll have to go and download that jar file yourself and drop it in the lib folder to make use of jconsole.]

My understanding was that everything in the cassandra.yaml configuration file should have been created the very first time my localhost Cassandra node was run, but any changes thereafter would need to be made via Thrift or JMX. [update: that was an incorrect assumption, the default keyspace will not be created as of Cassandra 0.7] So, I thought I should have seen my column families and super column families, but they weren't there. I don't know if I'm just misunderstanding or I missed a step somewhere. One possible explanation is that before Beta2 of Cassandra was released, I had installed a nightly build (post beta1) that matched up with a Hector build, and that nightly build might have had the initialization step broken. I didn't delete everything to start with a clean slate before upgrading to the 0.7 Beta2 version. I don't know if starting from a clean beta2 would have helped or not, but I suspect it would not have made a difference.

Anyway, I viewed this as an opportunity to try my hand at creating my own keyspace, column families and super column families. If you happen to already have these installed, then you're in good shape for running the example code, but sooner or later you're going to need to learn how to create these yourself, so let's walk through that now.

Start up cassandra-cli and configure the keyspace:

create keyspace Keyspace1 with replication_factor = 1

For a single node on your localhost, replication_factor has to be 1.

[Note: the documentation says to include the placement strategy in the command, such as "placement_strategy = org.apache.cassandra.locator.RackUnawareStrategy", but that doesn't work and the cli expects an integer. I haven't yet found a mapping of placement strategies to integers, but am still looking.]

Next configure the 2 column families following the rudimentary instructions given on this page of the Cassandra wiki.

[default@unknown] use Keyspace1
Authenticated to keyspace: Keyspace1
[default@Keyspace1] create column family Standard1 with column_type = 'Standard' and comparator = 'BytesType'
922a9664-bb01-11df-a919-e700f669bcfc
[default@Keyspace1] create column family Standard2 with column_type = 'Standard' and comparator = 'UTF8Type' and rows_cached = 10000
99ed2115-bb01-11df-a919-e700f669bcfc

Please take note of the identifier data that is spewed out after running a successful create command and save those values somewhere (example: 922a9664-bb01-11df-a919-e700f669bcfc). I have not yet found a way to list those values back out, and you may need them in the future.

The "describe keyspace" command now shows the 2 column families:

[default@unknown] use Keyspace1
Authenticated to keyspace: Keyspace1
[default@Keyspace1] describe keyspace Keyspace1
Keyspace: Keyspace1
  Replication Factor: 1
  Column Families:
    Column Family Name: Standard2 {
      Column Family Type: Standard
      Column Sorted By: org.apache.cassandra.db.marshal.UTF8Type
    }
    Column Family Name: Standard1 {
      Column Family Type: Standard
      Column Sorted By: org.apache.cassandra.db.marshal.BytesType
    }

We also need super column families for the example code. After finding no documentation anywhere on how to create a super column family, trial and error lead me to this command:

create column family Super1 with column_type=Super and comparator=BytesType

Now the describe command shows enough to get going on running the examples:

describe keyspace Keyspace1
Keyspace: Keyspace1
  Replication Factor: 1
  Column Families:
    Column Family Name: Super1 {
      Column Family Type: Super
      Column Sorted By: org.apache.cassandra.db.marshal.BytesType
    }
    Column Family Name: Standard2 {
      Column Family Type: Standard
      Column Sorted By: org.apache.cassandra.db.marshal.UTF8Type
    }
    Column Family Name: Standard1 {
      Column Family Type: Standard
      Column Sorted By: org.apache.cassandra.db.marshal.BytesType
    }

I suggest at this point you should run all of the Java examples and spend some time on each one making little modifications until you feel comfortable with them. Also take some time to reference them back with the Thrift API, because that will aid in your overall understanding. The example code closely resembles the examples in the PDF documentation file for Hector, so you can read the descriptions for each operation as you try to run and understand it.

I see that this post is becoming really long, so I'm going to break it up into multiple posts. I think at this point we have enough information to get started and experimenting around with some working code examples. I'll post in the future with the steps I'm going to follow to create a very simple application that uses a small 3-node Cassandra cluster.

Filed under: Java Leave a comment
Comments (6) Trackbacks (1)
  1. never too late to add a comment I suppose ;)

    any change of sharing your experience in getting the 3-node cluster up and running?

  2. (that should’ve been chance, not change… )

  3. I’m just slow, keep getting distracted by the temptation of Android. I’ve put the new post up, but will add more details, screenshots, and corrections here & there as I find them.

  4. I’m a total newbie with this stuff.
    Snow Leopard 10.6.6
    Downloaded Cassandra: apache-cassandra-0.7.3-bin.tar
    Didn’t download Thrift.
    But Cassandra appears to hang when run “raw”.

    Problems arose- Should I use Hector?
    How to use it?

    [Ran Cassandra as directed]

    apache-cassandra-0.7.3: sudo ./bin/cassandra -f
    Password:
    INFO 21:10:01,918 Logging initialized
    INFO 21:10:01,930 Heap size: 1052770304/1052770304
    INFO 21:10:01,932 JNA not found. Native methods will be disabled.
    INFO 21:10:01,940 Loading settings from file:/Users/robertfutrelle/Research/Cassandra/apache-cassandra-0.7.3/conf/cassandra.yaml

    [many lines omitted]

    INFO 21:10:02,556 Binding thrift service to localhost/10.0.1.3:9160
    INFO 21:10:02,558 Using TFastFramedTransport with a max frame size of 15728640 bytes.
    INFO 21:10:02,560 Listening for thrift clients…

    [then, nothing]
    ??

  5. When Cassandra logs “Listening for thrift clients…”, that’s a good thing. It means it’s waiting for a client to connect to it and do something. Hector is a Java client, so you could use Hector, but a simpler way to do a sanity check and confirm that it’s responding without having to write code is to use the cassandra-cli command line client. There’s a cassandra-cli tutorial on the Cassandra wiki. Since I wrote this post, Datastax has written some great Cassandra documentation that might also be helpful to you.

  6. you can use following command to create all keyspace, column families, super column families the examples used.

    bin/cassandra-cli -host localhost –file conf/schema-sample.txt


Leave a comment