LevelDB: SSTable and Log Structured Storage
Ilya Grigorik digs into LevelDB’s SSTable and log structured storage1:
If Protocol Buffers is the lingua franca of individual data record at Google, then the Sorted String Table (SSTable) is one of the most popular outputs for storing, processing, and exchanging datasets. As the name itself implies, an SSTable is a simple abstraction to efficiently store large numbers of key-value pairs while optimizing for high throughput, sequential read/write workloads.
Even if not very talked about, LevelDB is making notable contributions to the NoSQL space: the leveled compaction strategy in Cassandra 1.0 is based on LevelDB and Riak ships with LevelDB since 1.0.
-
Make sure you are not missing Writes Performance: B+Tree, LSM Tree, Fractal Tree ↩
Original title and link: LevelDB: SSTable and Log Structured Storage (NoSQL database©myNoSQL)
The Design of 99designs - A Clean Tens of Millions Pageviews Architecture
By pure coincidence, General Chicken just published on High Scalability a bullet point summary of the 99designs architecture I’ve linked and commented on earlier.
Original title and link: The Design of 99designs - A Clean Tens of Millions Pageviews Architecture (NoSQL database©myNoSQL)
99designs: Powered by Amazon RDS, Redis, MongoDB, and Memcached
While the authoritative storage is Amazon RDS, 99designs is using Redis, MongoDB, and Memcached for transient data:
We log errors and statistics to capped collections in MongoDB, providing us with more insight into our system’s performance. Redis captures per-user information about which features are enabled at any given time; it supports our development stragegy around dark launches, soft launches and incremental feature rollouts.
It’s also worth noting the nice things they say about using Amazon RDS:
An RDS instance configured to use multiple availability zones provides master-master replication, providing crucial redundancy for our DB layer. This feature has already saved our bacon multiple times: the fail over has been smooth enough that by the time we realised anything was wrong, another master was correctly serving requests. Its rolling backups provide a means of disaster recovery. We load-balance reads across multiple slaves as a means of maintaining performance as the load on our database increases.
Original title and link: 99designs: Powered by Amazon RDS, Redis, MongoDB, and Memcached (NoSQL database©myNoSQL)
Data Grid or NoSQL? What are the common points? The main differences?
A great post by Olivier Mallassi on a topic that comes up very often: how do data grids and NoSQL databases compare?
- Data Grids enable you controlling the way data is stored. They all have default implementation (Gigaspaces offers RDBMS by default, Gemfire offers file and disk based storage by default….) but in all cases, you can choose the one that fits your needs: do you need to store data, do you need to relieve the existing databases….
- In order to minimize the latency, data grids enable you to store data synchronously (write-through) or asynchronously (write-behind) on disk. You can also define overflow strategies. In that case, data is store in memory up to a treshold where data is flushed on disk (following algorithms like LRU …). NoSQL solutions have not been designed to provide these features.
- Data grids enable you developing Event Driven Architecture.
- Querying is maybe the point on which pure NoSQL solutions and data grids are merging.
- Data grids enable near-cache topologies.
Taking a step back you’ll notice that there are actually more similarities than differences. While Oliver Mallasi lists the above points as features that prove data grids as being more configurable and so more adaptable, some of these do exist also in the NoSQL databases taking different forms:
- pluggable storage backends. Not many of the NoSQL databases have this feature,but Riak and Project Voldemort are offering different solutions that are optimized for specific scenarios.
- replicated and durable writes. Not the same as synchronous vs asynchronous writes, but a different perspective on writes.
- Notification mechanisms. Once again not all of the NoSQL databases support notification mechanisms, but a couple of them have offer some interesting approaches:
- CouchDB: _changes feed with filters
- Riak: pre-commit and post-commit hooks
- HBase coprocessors
- Most of the NoSQL database have local per-node caches.
With these, I’ve probably made things even blurrier. But let me try to draw a line between data grids and NoSQL databases. Data grids are optimized for handling data in memory. Everything that spills over is secondary. On the other hand, NoSQL databases are for storing data. Thus they focus on how they organize data (on disk or in memory) and optimize access to it. Data grids are a processing/architectural model. NoSQL databases are storage solutions.
Original title and link: Data Grid or NoSQL? What are the common points? The main differences? (NoSQL database©myNoSQL)
Redis and Python: Building a Markov-chain IRC bot
Charles Leifer:
As an IRC bot enthusiast and tinkerer, I would like to describe the most enduring and popular bot I’ve written, a markov-chain bot. Markov chains can be used to generate realistic text, and so are great fodder for IRC bots.
Redis acts, in many ways, like a big python dictionary that can store several types of useful data structures. For our purposes, we will use the set data type. The top-level keyspace will contain our “keys”, which will be encoded links in our markov chain. At each key there will be a set of words that have followed the words encoded in the key. To generate “random” messages, we’ll use the “SRANDMEMBER” command, which returns a random member from a set.
- You could use other NoSQL database for this, but you’d miss Redis’s support for sets and the O(1) SRANDMEMBER
- On the other hand imagine storing the corpus in a graph database where the nodes would represent the words and vertices would carry to pieces of information: the frequency of the connection in the corpus and the frequence of the connection used to generate the output.
Original title and link: Redis and Python: Building a Markov-chain IRC bot (NoSQL database©myNoSQL)
Calculating a Graph's Degree Distribution Using R MapReduce over Hadoop
Marko Rodriguez is experimenting with R on Hadoop and one of his exercises is calculating a graph’s degree distribution. I confess I had to use Wikipedia for reminding what’s the definition of a node degree:
- The degree of a node in a network (sometimes referred to incorrectly as the connectivity) is the number of connections or edges the node has to other nodes. The degree distribution P(k) of a network is then defined to be the fraction of nodes in the network with degree k.
- The degree distribution is very important in studying both real networks, such as the Internet and social networks, and theoretical networks.
As an imagination exercise think of a graph database that’s actively maintaining an internal degree distribution and uses it to suggest or partition the graph. Would that work?
Original title and link: Calculating a Graph’s Degree Distribution Using R MapReduce over Hadoop (NoSQL database©myNoSQL)
MongoDB vs MySQL: A DevOps point of view
Pierre Bailet and Mathieu Poumeyrol of fotopedia (a French photo site) share their experience of operating a small MongoDB cluster since Sep.2009 compared to a MySQL cluster.
Some details about fotopedia:
- fotopedia is 100% on AWS
- Amazon RDS for MySQL
- 4 nodes MongoDB cluster
- 150mil. photo views
MongoDB advantages:
- no alter table
- background index creation
- data backup & restoration
- note: as far as I can tell MySQL is able to do the same
- replica sets
- hardware migration
- note: the same procedure can be used for MySQL
Before leaving you with the slides, here is an interesting accepted trade-off:
Quietly losing seconds of writes is preferable to:
- weekly minutes-long maintenance periods
- minutes-long unscheduled downtime and manual failover in case of hardware failures
Original title and link: MongoDB vs MySQL: A DevOps point of view (NoSQL database©myNoSQL)
Whirr and Hadoop Quickstart Guide: Automating a Rackspace Hadoop Cluster
Even if most of the examples show Whirr in action on the Amazon cloud, Whirr it’s cloud-neutral. Bob Gourley uses Whirr to fire up a CDH1 cluster on Rackspace.
-
Cloudera Distribution of Hadoop. ↩
Original title and link: Whirr and Hadoop Quickstart Guide: Automating a Rackspace Hadoop Cluster (NoSQL database©myNoSQL)
Using Twitter Storm to analyze the Twitter Stream
Francisco Jordano introduces briefly the concepts and provides some good resources for learnign about Twitter Storm just to present his experiment of using Twitter Storm for analyzing the Twitter (nb: the project is on GitHub ):

That’s how the information will flow, and the kind of tasks that we will execute. Yes it’s more effective to group some of those tasks, but remember, we just wanted to give this a try ;P
Worth emphasizing is that nowhere in the post is any reference of him having any troubles finding documentation or getting Twitter Storm up and running.
Original title and link: Using Twitter Storm to analyze the Twitter Stream (NoSQL database©myNoSQL)
Research in the MapReduce Space
Over the weekend I’ve read two papers presenting products or research related to improving or adding new capabilities to the MapReduce data processing approach. The first of them comes from a team at Microsoft and is describing TiMR a time-oriented data processing system in MapReduce. The second, from a team at Google, presents Tenzin - a SQL implementation on the MapReduce framework. It’s great to learn that while the Hadoop community is eliminating some of the initial limitations and hardening the technical details of the platform, there are already ideas and systems out there that augment the capabilities of the MapReduce data processing model.
Original title and link: Research in the MapReduce Space (NoSQL database©myNoSQL)
Paper: Tenzing A SQL Implementation on the MapReduce Framework
This recent paper from a team at Google is presenting details about Tenzing a system that is currently in use at Google:
Tenzing is a query engine built on top of MapReduce for ad hoc analysis of Google data. Tenzing supports a mostly complete SQL implementation (with several extensions) combined with several key characteristics such as heterogeneity, high performance, scalability, reliability, metadata awareness, low latency, support for columnar storage and structured data, and easy extensibility.
A couple of things I’ve highlighted when reading it:
- Tenzing is in production, but doesn’t serve yet a huge amount of queries
- the backend storage can be a mix of various data stores, such as ColumnIO, Bigtable, GFS files, MySQL databases
- when compared with other similar solutions (Sawzall, Flume-Java, Pig, Hive„ HadoopDB), Tenzing’s advantage is low latency
- the paper acknowledges AsterData, GreenPlum, Paraccel, Vertica for using a MapReduce execution model in their engines
- to perform query optimizations, Tenzing is enhancing queries with information from a metadata server
- there is no information about what kind of metadata is needed in Tenzing. I assume it might refer to details about the data sources and data source metadata (indexes, access patterns, etc)
- to reduce query latency, processes are kept running
- Tenzing supports almost all SQL92 standard and some extensions from SQL99
- projection and filtering (for some of these and depending on the data source Tenzing can do some optimizations)
- set operations (implemented in the reduce phase)
- nested queries and subqueries
- aggregation and statistical functions
- analytic functions (syntax similar to PostgreSQL/Oracle)
- OLAP extensions
-
JOINs:
Tenzing supports efficient joins across data sources, such as ColumnIO to Bigtable; inner, left, right, cross, and full outer joins; and equi semi-equi, non-equi and function based joins. Cross joins are only supported for tables small enough to fit in memory, and right outer joins are supported only with sort/merge joins. Non-equi correlated subqueries are currently not supported. We include distributed implementations for nested loop, sort/merge and hash joins.
Read and download the “Tenzing A SQL Implementation on the MapReduce framework” after the break.
Download PDFOriginal title and link: Paper: Tenzing A SQL Implementation on the MapReduce Framework (NoSQL database©myNoSQL)
Paper: TiMR is a Time-oriented data processing system in MapReduce
From the “Temporal Analytics on Big Data for Web Advertising” paper:
TiMR is a framework that transparently combines a map-reduce (M-R) system with a temporal DSMS1. Users express time-oriented analytics using a temporal (DSMS) query lan- guage such as StreamSQL or LINQ. Streaming queries are declarative and easy to write/debug, real-time-ready, and often several orders of magnitude smaller than equivalent custom code for time-oriented applications. TiMR allows the temporal queries to transparently scale on offline temporal data in a cluster by leveraging existing M-R infrastructure.
Broadly speaking, TiMR’s architecture of compiling higher level queries into M-R stages is similar to that of Pig/SCOPE. However, TiMR specializes in time-oriented queries and data, with several new features such as: (1) the use of an unmodified DSMS as part of compilation, parallelization, and execution; and (2) the exploitation of new temporal parallelization opportunities unique to our setting. In addition, we leverage the temporal algebra underlying the DSMS in order to guarantee repeatability across runs in TiMR within M-R (when handling failures), as well as over live data.
According to the paper, DSMS work well for real-time data, but are not massively scalable. On the other hand, Map-Reduce is extremely scalable, but computation is performed on offline data. TiMR proposes a solution that is getting closer to a real-time map-reduce.
Read or download the paper after the break.
Download PDF-
Data Stream Management System ↩
Original title and link: Paper: TiMR is a Time-oriented data processing system in MapReduce (NoSQL database©myNoSQL)
Hadoop and NoSQL in a Big Data Environment with Ron Bodkin
Ron Bodkin interviewed by Michael Floyd over InfoQ describes the Hadoop growing addiction:
People are using Hadoop for a variety of analytics. Many of the first uses of Hadoop are complementing traditional data warehouses I just mentioned, where the goal is to take some of the pressure of the data warehouse, start to be able to process less structured data more effectively and to be able to do transformations and build summaries and aggregates, but not have to have all that data loaded to the data warehouse. But then the next thing that happens is once people have started doing that level of processing they realize there is a power of being able to ask questions they never thought of before the data, they can store all the data in small samples and they can go back and have a powerful query engine, a cluster of commodity machines that lets them dig into that raw data and analyze it new ways ultimately leading to data science being able to do machine learning and being able to discover patterns in data and keep them improving and refining the data.
The interview is only 16 minutes long and you have the full transcript.
Original title and link: Hadoop and NoSQL in a Big Data Environment with Ron Bodkin (NoSQL database©myNoSQL)
Cassandra at SocialFlow with Drew Robb - Powered by NoSQL
To alternate a bit after yesterday’s educational CQL: SQL for Cassandra in the Cassandra NYC 2011 video series from DataStax, today’s video is Drew Robb covering Cassandra usage at SocialFlow for capturing real-time data from Twitter and Bit.ly.
For watching more videos from this event follow the Cassandra NYC 2011 tag.
Original title and link: Cassandra at SocialFlow with Drew Robb - Powered by NoSQL (NoSQL database©myNoSQL)
XFS: the filesystem of the future?
Jonathan Corbet summarizing a presentation about the present and future of XFS by Dave Chinner:
XFS is often seen as the filesystem for people with massive amounts of data. It serves that role well, Dave said, and it has traditionally performed well for a lot of workloads. Where things have tended to fall down is in the writing of metadata; support for workloads that generate a lot of metadata writes has been a longstanding weak point for the filesystem. In short, metadata writes were slow, and did not really scale past even a single CPU.
After the break the video of Dave Chinner’s presentation, “XFS: Recent and Future Adventures in Filesystem scalability”.
Even if it’s very long, make sure you check the comment thread.
Original title and link: XFS: the filesystem of the future? (NoSQL database©myNoSQL)
CQL: SQL for Cassandra with Eric Evans - NoSQL videos
The fine folks from DataStax have made available the presentations from their Cassandra NYC 2011 event.
The first video to post here is Eric Evans’s presentation on Cassandra Query Language.
CQL: SQL for CassandraFor watching more videos from this event follow the Cassandra NYC 2011 tag.
Original title and link: CQL: SQL for Cassandra with Eric Evans - NoSQL videos (NoSQL database©myNoSQL)
A bit of history around Hadoop Companies
Imagine this story 5 years from now… with all the scars from the battle competition with the other companies trying to monetize Hadoop and … the millions in the bank.
Original title and link: A bit of history around Hadoop Companies (NoSQL database©myNoSQL)
Designing HBase Schema to Best Support Specific Queries
Real scenario, very good analysis of different data access requirements, and three possible solutions. What’s your pick?
The problem is fairly simple - I am storing “notifications” in hbase, each of which has a status (“new”, “seen”, and “read”). Here are the API’s I need to provide:
- Get all notifications for a user
- Get all “new” notifications for a user
- Get the count of all “new” notifications for a user
- Update status for a notification
- Update status for all of a user’s notifications
- Get all “new” notifications accross the database
- Notifications should be scannable in reverse chronological order and allow pagination.
Original title and link: Designing HBase Schema to Best Support Specific Queries (NoSQL database©myNoSQL)
Modeling A/B Tests With Cassandra
A Cassandra data modeling session around a real-life scenario: tracking data for A/B tests:
With most things in life data modeling in Cassandra can be compared to learning to ride a bike. It can be scary, you might fall off, but in the end once you learn a few fundamental concepts everything will be easier to do. The goal of this article is to get you comfortable with a basic data modeling scenario that you will likely see in the real world.
Original title and link: Modeling A/B Tests With Cassandra (NoSQL database©myNoSQL)
What's the big deal about Big Data?
Roger Ehrenberg (Founder and Managing Partner of IA Ventures):
Every so often a term becomes so beloved by media that it moves from “instructive” to “hackneyed” to “worthless,” and Big Data is one of those terms. […] But since this time the term Big Data has become diluted. Very diluted. So much so that it is almost totally meaningless. Does Big Data mean new kinds of databases? Sure. Does it mean innovative ways to visualize data to create actionable intelligence? Absolutely. Can it be applied to the health care sector? Without question. Has it contributed to the rise of the Data Scientist? Mos def.
And I thought I was off.
Original title and link: What’s the big deal about Big Data? (NoSQL database©myNoSQL)