My PhD's a go! - Graph database with the distributed nature of Cassandra and the Graph properties of Neo4J

PUBLIC INDEX legacy zcourts.com ZCOURTS - VOL. 02

LONDON - 2026

I've been thinking about this for a while now and I've made a solid decision finally. At some point later this year I'll be starting a PhD (Probably October).

Static article

Imported body

Legacy aliases

All writing

Article archive

FIG. 02

notes

surface

research

surface

Article

app model

publish

surface

Imported writing rendered as native Fission Markdown content.

SECTION

Source and context.

The static release keeps the original post body locally while the backend content pipeline is still being built.

legacy

archive

cassandra

chris-walshaw

distributed-database

distributed-graph-database

flockdb

graph

graph-database

graph-partitioning

graph-processing

hadoop

haskell

hbase

jvm

neo4j

phd

titan

general

Published 2013-06-28 on legacy zcourts.com. Estimated reading time: 3 min.

Original routes are preserved as local aliases so older links keep resolving to this static archive.

I've been thinking about this for a while now and I've made a solid decision finally. At some point later this year I'll be starting a PhD (Probably October).

I've been using Apache Cassandra for years now, since 2008 not long after Facebook open sourced it. Since then I've played with most of the major NoSQL databases and frameworks (Neo4J, HBase, CouchDB, Hadoop, etc) and in virtually all the projects I've found the need to repeatedly be modelling graph or graph-like data. In some cases it's worked out great, in others it was a terrible idea but luckily I've always recognised very early on when the data model is just terrible for that DB so haven't wasted time on it.

More to the point though, Graph databases are extremely useful. And there have been numerous attempts at implementing a graph interface on top of existing DBs, Titan for example runs on top of Cassandra, HBase or BerkleyDB. Or even native Graph DBs such as Twitter's FlockDB or Neo4J.

But I think this area is still under-developed and there's a lot of room for research and improvement.

My first order of business is keeping Java away from the implementation! Don't get me wrong, I love Java as much as the next guy but from my experience with Cassandra, HBase and Hadoop, Java adds unnecessary problems to an already complicated situation. I am especially disappointed that until this day Sun/Oracle has not sorted out the JVM memory size limit. Seriously, with all the cloud computing malarkey, memory is cheaper than ever, there are massive advantages to storing as much information in memory as possible but the JVM craps out at about 8GB on 64 bit machines. You can try to push it further, people have been known to use bigger heap sizes but you're asking for trouble doing that.

What I am considering and have pretty much decided is that I'll be implementing whatever results from my PhD in Haskell. For a few reasons

I get memory management

It's a beautiful language

No crappy 8GB memory limit

I want to learn it (this should probably have been number 1)

I still get cross-platform support

Better interface with native C/C++ libraries because some things are just better left in C

I'm familiar with Haskell so a part of this will just be getting to know it as well as I do Java, PHP, Scala etc. My other reasoning is simply that I am combining things I'm enthusiastic about. "Big Data" and the promise of learning a language that intrigues me. It'll be fun!

Ultimately the more serious side of my research will focus on developing a completely distributed graph database. The real challenge, or the big problem I'm going after is automatic partitioning of graph data across a cluster. Imagine a database with the horizontal scalability (add machines as opposed to CPU/RAM) of Cassandra but with the graph processing capabilities of Neo4J.

I'm fortunate enough to know Chris Walshaw who is well established in the graph partitioning/processing space and has agreed to be my project supervisor.

For once I'll be in education and enjoying it because it'll be under my terms and doing things I like, pretty exciting times ahead!

SECTION

Related writing.

More imported posts from the current and legacy zcourts archives.

2026-04-09

Should you write your own database?

We run a data pipeline that ingests very large product snapshots, turns them into downstream feed artifacts, and then keeps those feeds fresh with smaller incremental updates. The pipeline has a weekly "start from a new baseline" phase a...

2026-02-13

AI, Energy, and the Infrastructure Cycle: What Rising Power Bills Tell Us About the Next Economic Phase

For many households in the UK, the story begins not with artificial intelligence, but with a monthly or quarterly energy bill.

2026-01-28