Skip to content

NoSQL

Permissions with Django-nonrel

Django nonrel / NoSQL blog - Tue, 08/31/2010 - 11:44

Quick update: Florian Hahn has implemented a solution for permission handling with Django on non-relational databases. Django's own permission system unfortunately requires JOIN support and thus doesn't work. After his original announcement the code has been optimized, so a permission check can be done with just two database operations. Also, his backend now scales with the number of users. Florian has posted installation and usage instructions on his blog. Check it out if you need permission support in your project.

Categories: Blogs, NoSQL

CouchDB: The Definitive Guide — Redesigned Website, Up To Date Content and All Open Source And Forkable on GitHub

Couchio - About CouchDB - Sat, 08/28/2010 - 23:22

Last week we (Noah, Chris and Jan, with great design help from Kristina Schneider) released our latest work on CouchDB: The Definitive Guide.

The latest update includes:

  • Updated and edited content as it appears in the printed book.
  • All new styles for your reading pleasure.
  • All new open source setup using Github for contributions and translations.

It took quite a while to get it all out, but we couldn’t be more proud about what we can offer you now. O’Reilly’s great editorial team went through all our chapters and cleaned up all the nitty-gritty and we think it turned out great. It feels almost like a real book now ;)

The Long Haul

The bulk of the work went into making sure the book is super easy to work on in the future. We want to be able to improve the book and collaborate with the open source community as much and as efficient as possible.

Minimal Technology

This includes switching writing the sources from Asciidoc to a subset of HTML. This allows us to work with the native web, our primary publication platform without having to deal with conversion issues left and right (trust us on that one).

The minimal nature of our markup might throw you off (we don’t even close our

’s!) but it is really great to write in and keeps everything lean and free from technological cruft that is not really needed. We encourage you to view source at least once.

A small XSLT (yuck, we know ;) transforms our minimal HTML into DocBook which O’Reilly in turn can take and produce a book from every once in a while.

Github Goodness

We’ve all been using Git and Github for quite some time in other projects and we finally migrated our book repository over (yes, Kristina is an avid Git user :). Half way through, Github introduced Organizations and it is just perfect for our needs.

You can now fork, edit, and contribute back to the book without much hassle. This is a good opportunity to thank the folks at O’Reilly again about their commitment to open source and allowing us to publish our work under the Creative Commons license.

Aftermath

Getting everything together truly felt like shipping a 1.0 project. Once we agreed on a final date, we cut our todo list in half in favour of getting done in time. There is still work to do, but we managed to get all the rough edges out.

The result is amazing, in the few days we’ve been live, more work on the content has been done than in the past six months. We already included contributions from third-party contributors and the German Translation is already two chapters into being done.

And nothing is stopping you from helping out :)

Here’s to the second edition!

— Chris, Noah, Jan & Kristina

Categories: Companies, NoSQL

What’s new in CouchDB 1.0 — Part 4: Security’n stuff: Users, Authentication, Authorisation and Permissions

Couchio - About CouchDB - Sat, 08/28/2010 - 22:22

Welcome to Part 4 on my little mini-series on new features in CouchDB 0.11.01.0. Do not miss parts one, two and three.

Today, I get a little help from Rebecca. She’s writing a CouchApp, an application that is served right out of CouchDB and that lives in the browser. It has no middle tier application server in Ruby or Java. The application and display logic is written in JavaScript, the user interface is HTML & CSS, the backend is CouchDB and uses Ajax to shove JSON back and forth.

Rebecca is writing a small todo list app for herself and her friends, but we’ll punt on the actual application for now, so we can concentrate on the security features. Part two and three of our book CouchDB: The Definitive Guide explains how the rest of the application development works, make sure to read up on it!

Security

Security is a wide field. This article only discusses some of things you need to know to add authenticated-user features to your CouchDB application (whether it is a CouchApp or a regular application). It does not discuss best practices for securing network servers or defending against cross site scripting.

View Source is Open Source

The CouchDB security model is based around the premise that Rebecca can control who can create documents of what form into which database inside CouchDB. It does not try to make CouchDB and all the data she and others put in is absolutely water-tight and doesn’t leak any information. Although you can lock CouchDB as much down as you need, open and sharable databases are the default and it is a good thing.

Traditional applications are built with a database in the back; an application is the only logical “user” of a database, the only entity that accesses the database directly (aside from maybe an administrator). CouchDB happily supports that model (there is nothing wrong with it either). The only case where security is relevant here is shared hosting where you have multiple mutually untrusting parties accessing a single CouchDB instance. The mechanics I describe can be used to make CouchDB useful here, but I won’t describe this specific scenario. Partly because it is a lot simpler to just give every user a separate full instance of CouchDB with “root” access, but mostly because I believe there is a much more interesting deployment scenario:

The idea of standalone CouchApps is that they travel with the data, since they are just some HTML, CSS & JavaScript that CouchDB tack onto a _design document as attachments. Applications as data allow us to replicate them around just like we do with data. Data ultimately wants to be free and shareable with the people and applications we trust. Why shouldn’t applications do the same?

Since CouchApps run in the browser you can’t hide their implementation anywhere. In that, CouchApps are inherently Open Source and we believe that is a good thing because that is how the web works and that’s a crucial feature of the web. View source allows everyone curious to learn how a website was built.

Yadda, yadda, Open Source zealottery, you can’t hear it anymore, sorry :) — If you like your Rails, Django, PHP or Java, CouchDB won’t prevent you from using them. You can create private, closed source applications, but you’re losing the powerful attribute of native app-shareability.

All that said you can make CouchDB as closed as you need it, but with each barrier to entry you lose a layer of data and application flexibility. You might want to reconsider some of your previous ideas about how to lock down your app in the light of ultra-portable, peer-to-peer shareable applications.

Ok, CouchApps are a big deal, you get that now, I’ll shut up. Back to the nuts and bolts of CouchDB security :)

Terminology

Lets make sure we talk about the same things.

  • Admin Party: CouchDB by default comes in the admin party mode. Each request made to CouchDB is considered to come from an admin. This means it is extremely easy to get started. Isn’t that terribly insecure, you ask. By default CouchDB will only listen on 127.0.0.1, your localhost IP address. Only users on your computer can access CouchDB. Most of the time that‘s just you, so no biggie; but be aware of this when you are on machine with multiple, possibly untrusting users.

  • Database: A database is a bucket that holds any number of documents in CouchDB. Each CouchDB server can have any number of databases. Each database is self contained and access to CouchDB can be defined on a per-database level. I’ll show you how below.

  • User: A user is identified by a username and matching password that is securely stored inside CouchDB. A user can have one or more roles assigned. A user with an empty name and password is the anonymous user. CouchDB further distinguishes admin users between server admins and database admins.

  • Authentication: The process of a user proving it’s her by providing the correct username / password combination in an authenticated HTTP request.

  • Authorisation: The process of determining whether an authenticated user is allowed to do what she wants to do.

  • Roles: Roles are associated with users, you could also call them “group”. For example, in Admin Party mode, each request is implicitly authenticated with the anonymous user that in turn implicitly gets assigned the _admin role that allows each request to do anything.

  • Anonymous User: A user with an empty username and password. All unauthenticated requests are implicitly assigned to the anonymous user.

  • Access Control Lists: A list of usernames or roles for a database. CouchDB distinguishes reader-ACLs and admin-ACLs. A database admin can fully access the database. The reader-ACL list defines a list of users or roles that can read from the database. If no reader-ACLs are defined, everybody can read from the database. Note that there is no writer-ACL; see Validation Functions next.

  • Validation Functions: A JavaScript function stored in the validate_doc_update field of a _design document. It gets executed whenever a write requests reaches the database. It can decide to allow or deny access to the database based on the document that is being written and the authenticated user or her roles.

  • Stateless HTTP: Each HTTP requests stands on its own. A client and server do not expect any previous requests have be made.

  • Basic Auth: An authentication mechanism for HTTP that uses base64 encoded headers to send a users credentials to the server. Most notably known for producing an ugly pop-up window in the browser (although this can be prevented). Base64 encoding is not a form of encryption. It is not a safe transport for user credentials. It is easy for third parties to spy out a user’s password. Basic auth can only reasonably used in a trusted environment; a local LAN, a VPN or over SSL.

  • Secure Cookie Auth: Unlike Basic Auth Secure Cookie Auth uses HMAC-encryption for transporting user credentials. It can be used to securely authenticate users over an untrusted connection.

  • OAuth: Lets users allow applications to authenticate as the user to a service. The canonical example is a web application that does something with a user’s private account data on another service. With OAuth, the web application does not have to know the user’s credentials to do its work. Access permissions can be managed and revoked on a per-application basis. OAuth is not limited to web applications though.

Getting Started with Security in Futon

Let’s start with a blank slate. A fresh installation of CouchDB 0.11.0 or later, the admin party and a look at Futon:

In the lower right you should see that in fact, we are having an admin party.

Admin Party!

You should also see a link that asks you to “Fix This”. Click it and you should see a form that asks you to specify a username and password for the first server admin user.

Creating an Admin User

I put in rebecca and 12345, you can choose whatever else you like. Just remember it, otherwise, you’re locked out of the server (there are ways to get in again, but that’s beyond the scope of this article.)

The lower right should now greet you with your username and offer you to create more admins or to log out. Futon uses Secure Cookie Authentication to keep you logged in.

Logged In

At this point, CouchDB no longer runs in Admin Party mode and requires you to be logged in to perform certain actions: Creating or deleting a database; creating, updating or deleting a _design document inside a database; read or update the _config API; read _stats or _log and request temporary views.

Under the Hood

Let’s see what goes on under the hood of Futon. We’re following the same steps as above, only we use curl on the command line instead of Futon.

First, see if CouchDB is running:

> curl http://127.0.0.1:5984/
{"couchdb":"Welcome","version":"1.0.0"}

Yes!

Next, let’s create an admin user:

> curl -X PUT http://127.0.0.1:5984/_config/admins/rebecca -d’"12345"’
""

Well that was easy. We created a config-level admin user and that takes CouchDB right out of admin party mode (intentionally left over production note: reference “partypooper” in some way).

Now all administrative requests to CouchDB (see above) need to be authenticated. To make your life a little easier Futon does a little dance for you and automatically logs you in as the newly created user.

What does logs you in mean? First, Futon creates a new document in the _users database. It has a special format that you have to follow if you are doing this on your own.

Luckily CouchDB’s buit-in client libraries couch.js and jquery.couch.js do all the heavy lifting for you.

> cat rebecca.json 
{
  "_id":"org.couchdb.user:rebecca",
  "name": "rebecca",
  "salt": "68cf5946d9760d19759b5016d90f612c",
  "password_sha": "3588a9b2039e53b674d8da361e4be98f00637f5a",
  "type":"user",
  "roles":["admin"]
}
> curl -X PUT "http://127.0.0.1:5984/_users/org.couchdb.user%3Arebecca" \
   -d@rebecca.json
{
   "ok":true,
   "id":"org.couchdb.user:rebecca",
   "rev":"1-9aa9e9a2c855e81061d6d8553d6adbc5"
}

This is laying foundations for the future. Once you have a user document in the _users database, you can use the _session API to get an encrypted session cookie that authenticates you for the next few requests. By default, a session cookie is valid for 10 minutes.

Showing all the cookie business with curl would be a little tedious, so I’ll jump over to how to do all that in your own code.

Aside: What’s the deal with users or admins created with the _config API vs. the _users database? Admins created through the _config API are persisted to CouchDB’s configuration file ($prefix/etc/couchdb/local.ini by default). Users in the _users database are stored in that database. For some setups, it is required that some external tool is able to create a user for CouchDB without having a user or admin account on that CouchDB, but access to the ini file (think system setup software). In addition, _config users are always automatically server admins, so use them with care.

Using jquery.couch.js in Your Application

jquery.couch.js is the standard JavaScript API that ships with CouchDB. Futon uses it for its snazzy interface. And so can you, or should, really, unless you want to re-do all the work the CouchDB project put into it :)

Let me show you the methods in question. I’m quoting right out of the API docs:

$.couch.signup(user_doc, password, options)
  • Hashes the password
  • Adds an empty roles array to the user_doc when not specified
  • Adds an id, composed of “org.couchdb.user:” and name, to the userdoc when not specified
  • Saves the user_doc with options as parameters in the userDb
  • Performs the success callback on the saved user_doc
$.couch.login(options)
  • Does a POST request to “_session” with username and password, they have to be present in the options hash. Throws a 404 error when the password is wrong or there is no user with that username stored in the userDb.
$.couch.logout(options)
  • Does a DELETE request to “_session”.
Concepts

The _session API provides you with a convenient endpoint to manage authenticated requests to CouchDB. A simple GET /_session returns a JSON object detailing your current session state.

> curl http://127.0.0.1:5984/_session | jsonpretty 
{
  "userCtx": {
    "name": null,
    "roles": [

    ]
  },
  "info": {
    "authentication_handlers": [
      "oauth",
      "cookie",
      "default"
    ],
    "authentication_db": "_users"
  },
  "ok": true
}

userCtx is where all the authentication information is stored. name is your login username and roles is list of roles your user has assigned to it.

info has some server-wide information about the authentication system. authentication_handlers are the different ways CouchDB can do the actual authentication process for you. By default CouchDB ships with an OAuth handler, a cookie handler and the default handler (which does HTTP basic auth). The authentication_db is the database that user documents are stored in. The default is _users, but you can change it in the CouchDB configuration settings. Only do so if you have a very good reason.

ok just lets us know our request was a-ok.

We made an unauthenticated request to CouchDB, so we don’t see any values for userCtx.name or userCtx.roles. Let’s make one with user credentials:

> curl http://rebecca:12345@127.0.0.1:5984/_session | jsonpretty 
{
  "userCtx": {
    "name": "rebecca",
    "roles": [
      "_admin"
    ]
  },
  "info": {
    "authentication_handlers": [
      "oauth",
      "cookie",
      "default"
    ],
    "authenticated": "default",
    "authentication_db": "_users"
  },
  "ok": true
}

The result looks a lot similar, but this times the values inside userCtx are filled out. We used HTTP basic auth. The details of OAuth authentication are out of the scope of this article, but we sure should feature them at some point.

Cookie authentication or it’s full name Secure Cookie Authentication works by granting access through HMAC digest transported credentials and one time tokens. To increase convenience, a one time token is actually valid for 10 minutes by default, but you can adjust that as needed.

We showed you the login() method of jquery.couch.js earlier, use it to log a user into CouchDB with cookie authentication.

Roles

With roles you can group multiple users. We’ll show you in a bit how roles allow you to define permissions on the CouchDB server and individual databases. A role is a simple string that doesn’t start with an underscore. Underscore-roles are reserved to CouchDB. You roles can be anything, really. The only role that CouchDB prescribes is the _admin role (with an underscore, see?). It grants the user server-wide privileges to do anything.

ACLs, Database Admins & Validation Functions

To allow more fine-grained control over who can read from your databases, CouchDB comes with Access Control Lists (ACLs), Database Admins and Validation Functions.

Each database in CouchDB comes with its own security object. It is not a document, but simply a JSON structure associated with the database. On a newly created database, it looks like this:

{}

The empty object, duh :)

You can set two properties admins and readers. Both are another JSON object with the two properties roles and names and these two are lists of roles and names respectively.

Here is an example:

{
  "admins": {
    "roles": [],
    "names": ["rebecca"]
  }
}

For this database, our user rebecca is the admin. A database admin has full read access to the database as well as the ability to update the security object. You can add more users:

{
  "admins": {
    "roles": [],
    "names": ["rebecca", "pete"]
  }
}

Or if that starts to get tedious, you can assign roles, and then by adding roles to a user, they automatically inherit the right to administer the database. This assumes, “rebecca” and “pete” each have the “local-heros” role assigned:

{
  "admins": {
    "roles": ["local-heroes"],
    "names": []
  }
}
Readers

Now this is awesome :) — You can specify in the same way a list of usernames or roles to grant read-access to a database. If no readers are specified, everyone can read your database. This is cool, again, public databases make the world a better place.

In case you want only specific authenticated users to be able to read from your database, use the security object:

{
  "admins": {
    "roles": ["local-heroes"],
    "names": ["rebecca", "pete"]
  },
  "readers": {
    "roles": ["lolcat-heroes"],
    "names": ["simon", "ben", "james"]
  }
}

Now simon, ben and james are among your trusted readers as well as all users with the role “lolcat-heores”.

There is no need to add names or roles from the admins section, since they automatically are also readers.

Validation Functions or How to Control Write Access

What about restricting write access, can you just create a new property writers in the security object and do as before? — No, for this, you will be using a validation function.

CouchDB has had validation functions for quite some time and always have been the way of restricting write access to your database. The cool thing with validation functions is that they have full access to the document a user is trying to write as well as the user context, i.e. the username and any roles.

This allows a validation function to reject a document write because of both user-authentication (or the lack thereof) and document content or structure.

Validation functions are invoked once for every document that is written to the database. It gets passed the document to be written, the previous revision of a document, if it exists, and the user context. To block a document write, the validation function needs to throw an exception. The return value is ignored. If no exceptions are thrown, the document write can proceed.

Here are a few examples.

Disallowing anonymous writes:

function(new_doc, old_doc, userCtx) {
  if(!userCtx.name) {
    // CouchDB sets userCtx.name only after a successful authentication
    throw({forbidden: "Please log in first."});
  }
}

Only allow writes to users with a certain role:

function(new_doc, old_doc, userCtx) {
  if(userCtx.roles.indexOf("editors") === -1) {
    // sure lovely that JavaScript doesn’t
    // have an Array.includes() method
    throw({unauthorized: "You are not an editor."});
  }
}

Only allow updates by the author (this assumes, that the user sets his or her username as doc.name).

function(new_doc, old_doc, userCtx) {
  if(doc.name != userCtx.name) {
    throw({unauthorized: "You are not the author"});
  }
}
Conclusion

Alright, this was a really long post and we should get it wrapped. We hope to have given you a good overview of the security concepts in CouchDB and enough pointers to keep you reading and experimenting.

Categories: Companies, NoSQL

Zookeeper experience

Project Voldemort Blog - Thu, 08/26/2010 - 20:16

While working on Kafka, a distributed pub/sub system (more on that later) at LinkedIn, I need to use Zookeeper (ZK) to implement the load-balancing logic. I’d like to share my experience of using Zookeeper. First of all, for those of you who don’t know, Zookeeper is an Apache project that implements a consensus service based on a variant of Paxos (it’s similar to Google’s Chubby). ZK has a very simple, file system like API. One can create a path, set the value of a path, read the value of a path, delete a path, and list the children of a path. ZK does a couple of more interesting things: (a) one can register a watcher on a path and get notified when the children of a path or the value of a path is changed, (b) a path can be created as ephemeral, which means that if the client that created the path is gone, the path is automatically removed by the ZK server. However, don’t let the simple API fool you. One needs to understand a lot more than those APIs in order to use them properly. For me, this translates to weeks asking the ZK mailing list (which is pretty responsive) and our local ZK experts.

To get started, it’s important to understand the state transitions and the associated watcher events inside a ZK client. A ZK client can be in one of the 3 states, disconnected, connected, and closed. When a client is created, it’s in the disconnected state. Once a connection is established, the client is moved to the connected state. If the client loses its connection to a server, it switches back to the disconnected state. If it can’t connect to any server within some time limit, it’s eventually transitioned to the closed state. For each state transition, a state changing event (disconnected, syncconnected and expired) is sent to the client’s watcher. As you will see, those events are critical to the client. Finally, if one performs an operation on ZK when the client is in the disconnected state, a ConnectionLossException (CLE) is thrown back to the caller. More detailed information can be found at the ZK site. A lot of the subtleties when using ZK are to deal with those state changing events.

The first tricky issue is related to CLE. The problem is that when a CLE happens, the requested operation may or may not have taken place on ZK. If the connection was lost before the request reached the server, the operation didn’t take place. On the other hand, it can happen that the request did reach the server and got executed there. However, before the server can send a response back, the connection was lost. If the request is a read or an update, one can just keep retrying until the operation succeeds. It becomes a problem if the request is a create. If you simply retry, you may get a NodeExistsException and it’s not clear whether it’s you or someone else have created the path. What one can do is to set the value of the path to a client specific value during creation. If a NodeExistsException is thrown, read the value back to check who actually created it. One can’t use this approach for sequential paths (a ZK feature that creates a path with a generated sequential id) though. If you retry, a different path will be created. You also can’t check who created the path, since if you get a CLE, you don’t know the name of the path that gets created. For this reason, I think that sequential paths have very limited benefit since it’s very hard to use them correctly.

The second tricky issue is to distinguish between a disconnect and an expired event. The former happens when the ZK client can’t connect to the server. This is because either (1) the ZK server is down, or (2) the ZK server is up, but the ZK client is partitioned from the server or it is in a long GC pause and can’t send the heartbeat in time. In case (1), when the ZK server comes back, the client watcher will get a syncconnected event and everything is back to normal. Surprisingly, in this case, all the ephemeral paths and the watchers are still kept at the server and you don’t have to recreate them. In case (2), when the client finally reconnects to the server, it will get back an expired event. This implies that the server thinks the client is dead and has taken the liberty to delete all the ephemeral paths and watchers created by that client. It’s the responsibility of the client to start a new ZK session and to recreate the ephemeral paths and the watchers.

To deal with the above issues, one has to write additional code that keeps track of the ZK client state, starts a new session when the old one expires, and handles the CLE appropriately. For my application, I find the ZKClient package quite useful. ZKClient is a wrapper of the original ZK client. It maintains the current state of the ZK client, hides the CLE from the caller by retrying the request when the state is transitioned to connected again, and reconnects when necessary. ZKClient has an Apache license and has been used in Katta for quite some time. Even with the help of ZKClient, I still have to handle things like who actually created a path when a NodeExistsException occurs and re-registering after a session expires.

Finally, how do you test your ZK application, especially the various failure scenarios? One can use utilities like “ifconfig down/up” to simulate network partitioning. Todd Lipcon’s Gremlins seems very useful too.

Categories: NoSQL

Final, official GSoC Django NoSQL status update

Django nonrel / NoSQL blog - Sun, 08/22/2010 - 11:15

Alex Gaynor has posted a final status update on his Google Summer of Code (GSoC) project which should bring official NoSQL support to Django. Basically, Django now has a working MongoDB backend (not to be confused with the MongoDB backend for Django-nonrel: django-mongodb-engine) and (after lots of skepticism) the ORM indeed needed only minor changes to support non-relational backends (surprise, surprise ;). There are still a few open design issues, but probably the ORM changes will be merged into trunk and the MongoDB backend will become a separate project.

The biggest design issue (in my opinion) is how to handle AutoField. In the GSoC branch, non-relational model code would always need a manually added NativeAutoField(primary_key=True) because many NoSQL DBs use string-based primary keys. As you can see in Django-nonrel, a NativeAutoField is unnecessary. The normal AutoField already works very well and it has the advantage that you can reuse existing Django apps unmodified and you don't need a special NativeAutoField definition in your model. Hopefully this issue will get fixed before official NoSQL support is merged into trunk.

Another issue is about efficiency: In the GSoC branch, save() first checks whether the entity already exists in the DB by doing ...filter(pk=self.pk).exists() and then it decides whether to do an insert() or update() on the DB. Since non-relational DBs normally don't need to distinguish between inserts and updates we could just always call insert(). That would remove an unnecessary query from every save().

The final issue primarily affects App Engine's transaction support: When you delete() an entity Django will also delete all entities that point to that entity (via ForeignKey). This won't work in an App Engine transaction because it would access multiple entity groups. Also, this operation can take very long when batch-deleting multiple entities (via QuerySet.delete()). In the worst case it will cause DeadlineExceededErrors. The solution would be to allow the backend to handle the deletion. This way the App Engine backend (djangoappengine) could delegate the deletion to a background task.

For Django 1.3 it's probably sufficient to only handle the AutoField issue. This doesn't affect App Engine, though, so independent of that we'll port our App Engine backend to Django trunk once the GSoC branch has been merged. This means you will only need Django-nonrel if you want to use App Engine transactions. In all other cases you can use djangoappengine with the official Django release! Isn't this exciting? Maybe some of you have waited for official NoSQL support before porting their model code and now the time has come? What do you think? I'd love to hear your comments.

Categories: Blogs, NoSQL

The Kamikaze version 3.0.0 is released

Project Voldemort Blog - Sat, 08/21/2010 - 03:40

KamikazeKamikaze is a utility package wrapping set implementations on sorted integer arrays. Search indexes, graph algorithms and certain sparse matrix representations tend to make heavy use of sorted integer arrays.

For example, in search engines, for each term t, the index, or called inverted index, contains an inverted list, which is typically a sequence of sorted integer document IDs (and other information which can also be considered as sequences of integers). Thus, inverted index compression techniques are concerned with compressing sequences of sorted integers.

A graph is often implemented as adanjency lists. In many cases, each list can be easily organized as a sorted integer array. For example, for the social graphs in large-scale social networks like Linkedin or Facebook, each list is, for a particular member, a sequence of all his friends (represented as integer member IDs). The performance of many algorithms on such graphs is thus greatly affected by the efficiency of various operations on such lists. For example, in order to find all common friends of two members, we need to find all intersected member IDs of their friend lists.

A matrix can be considered as an alternative implementation of a graph especially when most nodes are directly connected with each other. However, when the matrix is sparse (which is very common for the first or second degree friends in social graphs), it is more efficient to first transfer it into the adancency lists and then do various operations on the resulting lists.

In the above applications (large scale search engines or social networks), we often need to process a huge amount of data (arrays of integers) within milliseconds. The data often need to be compressed to be hold in main memory. Due to compression, the disk traffic and the network traffic are also greatly reduced since much less amount of data need to be communicated. We also need to be able to decompress the data very efficiently to maximize, for example, the query throughput of search engines. To achieve these goals, large search engines have been trying a lot of methods. For example, Lucene uses variable-byte coding (please refer to Managing Gigabytes for various inverted index compression methods) to compress indexes. Google also uses variable-byte coding to encode part of its indexes a long time ago and has switched to other compression methods lately (In my opinion, their new method is a variation of PForDelta which is also implemented in Kamikaze and optimized in Kamikaze version 3.0.0). Therefore, we can see that it is very important to build Kamikaze on top of a good compression method that can achieve both the small compressed size and fast decompression speed.

Kamikaze implements PForDelta compression algorithm (or called P4Delta) which was recently studied and has been shown by paper[1] and paper[2] to be able to achieve the best trade-off of the compression ratio and decompression speed for inverted index of search engines. Many other techniques for inverted index compression have been studied in the literature; see Managing Gigabytes for a survey and paper[2] and paper[3] and for very recent work, especially the detailed performance comparison between most of those techniques and PForDelta. Unfortunately, Lucene does not support PForDelta now although paper[2] and paper[3] have shown that PForDelta can achieve much better performance than variable-byte coding in terms of both compressed size and decompression speed.

Kamikaze builds an platform on top of PForDelta to perform efficient set operations and inverted list compression/decompression. Kamikaze Version 3.0.0 inherits the architecture of the first two versions and supports the same APIs. In Version 3.0.0., the PForDelta algorithm is highly optimized such that the performance of compression/decompression and the corresponding set operations are improved significantly.

In Linkedin, Kamikaze has been used in the distributed graph team and search team. We are also looking forward to contributing to the Lucene community with Kamikaze, especially the optimized PForDelta compression algorithm.

Categories: NoSQL

MongoDB and Twilio Contest

The MongoDB NoSQL Database Blog - Wed, 08/18/2010 - 18:49

MongoDB and Twilio are teaming up this week to do a contest! You have until midnight on August 22nd to create an application using Twilio and MongoDB. The best application wins:

  • A netbook
  • $100 of Twilio credit
  • MongoDB Timbuk2 Laptop Bag (the only other ways to get one are to be a major contributor or write a MongoDB book)
  • MongoDB T-shirt
  • MongoDB Coffee Mug
  • MongoDB Stickers

It’s a very open ended contest: you can create any application, so long as it uses MongoDB to store some of its data and the Twilio API. When you’re done, submit it on the Twilio website. Some resources to get you started:

Twilio
MongoDB

Good luck!

Categories: NoSQL

Don't Reinvent The Wheel by Josh Berkus @ CouchCamp

Couchio - About CouchDB - Wed, 08/18/2010 - 18:20

The talk description that went out in some recent PR about Josh Berkus at CouchCamp wasn’t quite accurate, my bad.

Here is a description of the talk Josh will be giving at CouchCamp that we are all very much looking forward to.

Don’t Reinvent The Wheel

CouchDB, as a new database, is doing a lot of new cool stuff. But bucking the conventional wisdom doesn’t mean that you need to be ignorant of database history; the older databases like PostgreSQL have decades of experience which the Couch community can profitably steal from. This talk will touch on issues like scaling, security, complex queries, data architecture, optimization, upgrades, long-term maintenance, and standards that developers and users of Couch should be thinking of for the future of the database.

Categories: Companies, NoSQL

Using Sass with django-mediagenerator

Django nonrel / NoSQL blog - Tue, 08/17/2010 - 13:05

This is the second post in our django-mediagenerator series. If you haven't read it already, please read the first post before continuing: django-mediagenerator: total asset management

What is Sass?

Great that you ask. :) Sass is a high-level language for generating CSS. What? You still write CSS by hand? "That's so bourgeois." (Quick: Who said that in which TV series?) Totally. ;)

Sass to CSS is like Django templates to static HTML. Sass supports variables (e.g.: $mymargin: 10px), reusable code snippets, control statements (@if, etc.), and a more compact indentation-based syntax. You can even use selector inheritance to extend code that is defined in some other Sass file! Also, you can make computations like $mymargin / 2 which can come in very handy e.g. for building fluid grids. Let's see a very simple example of the base syntax:

.content
  padding: 0
  p
    margin-bottom: 2em
  .alert
    color: red

This produces the following CSS code:

.content {
  padding: 0;
}
.content p {
  margin-bottom: 2em;
}
.content .alert {
  color: red;
}

So, nesting can help reduce repetition and the cleaner syntax also makes Sass easier to type and read. Once you start using the advanced features you won't ever want to go back to CSS. But please do yourself a favor and don't use the ugly alternative SCSS syntax. :)

Do I have to convert my CSS by hand?

Nah, you don't have to do it by hand. One solution is to just go to the css2sass website and paste your CSS code there. The website will convert everything to nice Sass source. Alternatively, you can convert the CSS file using the sass-convert command line tool which comes pre-installed with Sass. Let's pretend you want to convert "style.css" to "style.sass":

sass-convert style.css style.sass

It's that easy.

Yes, I want to use it now!

Enough talk. Let's say you have a Sass file called "design.sass" and just for the fun of it you also have a CSS file from a jQuery plugin called "jquery.plugin.css". Let's combine everything into a "main.css" bundle:

# settings.py

MEDIA_BUNDLES = (
    ('main.css',
        'design.sass',
        'jquery.plugin.css',
    ),
)

It couldn't be easier. The media generator detects Sass files based on their file extension, so you just have to name the file. But here comes the best part: In your Sass files you can use the handy @import statement to import other files and the media generator will automatically keep track of all dependencies. So, whenever you change one of the dependencies the main Sass file gets recompiled on-the-fly. This is very much like the "sass --watch" development mode, but with the nice advantage that you don't even have to remember to start the Sass command. Barney Stinson would say: "It's legendary!" :)

So, if you want to feel legendary don't wait for it and quickly install Sass and give it a try with django-mediagenerator. It'll boost your CSS productivity to new heights.

Categories: Blogs, NoSQL

Guest blog post from Max Ogden, creator of PDX API We posted a case study today on PDX API, which is...

Couchio - About CouchDB - Tue, 08/17/2010 - 07:41

Guest blog post from Max Ogden, creator of PDX API

We posted a case study today on PDX API, which is a JSON API that provides access to data from CivicApps, the open data initiative by the City of Portland. Max worked with us to write a blog post that gives some background information about working with government geo data. We really appreciate Max taking the time to help share his use of CouchDB with the community.

Now for Max’s blog post


Portland, PDX API and GeoCouch


In the fall of 2009 the City of Portland graciously hosted the volunteer organized WhereCamp conference at Metro headquarters. Metro is a regional government organization that, among other things like operating many local parks and the zoo, acts as a data warehouse for the 45+ municipalities in the greater Portland area. WhereCamp is a geo un-conference, where instead of lectures there are group discussions on proposed topics. One of the highlights was a brainstorming session lead by Metro employees regarding how Metro can release their datasets to the public.

Since Sam Adams, the City of Portland’s current ‘younger, tech savvy’ mayor, took office, the idea of a city wide open data initiative had been drifting closer to reality. Metro is predictably bureaucratic, and there was apprehension within Metro about releasing data directly onto the internet. They didn’t want to have to spend a significant amount of money to engineer some sort of web infrastructure for hosting their vast collection of geo data. I have found that Metro’s concerns are also echoed throughout the region at all levels of government.

The solution came in the form of a joint effort amongst all of the major civic entities in Portland (such as Portland public schools, Metro, public transportation, etc). They started hosting around 100 raw GIS files of various shapes and sizes from CivicApps (http://www.civicapps.org), a website that they created for the initiative. The datasets themselves range in size from lists of bicycle parking racks to outlines of city parks and neighborhood boundaries.

After downloading some datasets and attempting to interact with the data it quickly became evident that the workflow was incredibly clunky. Most of the datasets are created and released as a Shapefile, which is a proprietary desktop GIS format. Expensive desktop GIS software is capable of analyzing the geo data in a Shapefile in many amazing ways, but it isn’t easy to extract the data out for other use cases. Shapefiles are the GIS equivalent of a Word document. Sure, the useful data is in there somewhere, but it’s buried deep within years of vestigial formatting bloat. CivicApps provides raw data downloads of entire datasets, which is a good start, but the same data could be distributed a more efficient and accessible manner. Public geographic data ought to live on a server, but accessed a-la carte in small, efficient and interesting chunks rather than the entire dataset at a time.

For example, one of the datasets available on CivicApps contains all restaurants in the Portland area. I wanted to see which restaurants were near my house, but in order to find the handful of restaurants in my neighborhood out of the 3300 listed it took countless open source data conversion utilities, hours of reading documentation and many cups of coffee. After going through this process a few times I decided that nobody else should have to dive that deep into GIS-land in order to get at the data in a meaningful way.

There is a definite disconnect between government and open source when it comes to understanding the term ‘accessible data’. Most non-GIS developers aren’t going to want to learn how to work with a Shapefile. A great strategy for gaining adoption in open data initiatives is to build distribution tools work around the constraints of the existing government data. Portland’s regional government, along with most GIS users at the professional level, use Shapefiles almost exclusively. You aren’t going to convince an entire region full of career GIS developers to convert their datasets to some random open source format. It is the responsibility of the community to develop tools to convert the government’s raw data into a more usable form.

Government-level open data initiatives are increasing in frequency for a variety of reasons. I believe that they will become successful when a cosymbiotic relationship forms between the regional government (data suppliers) and the developer community within the region (data consumers). When local developers create applications using government data, governments save money because they no longer have to hire contractors to create the applications themselves, and citizens get to reap the benefits of applications with rich data created from government maintained datasets.

This means that you need a platform for hosting open data that is built on formats that developers already know.

GeoCouch is CouchDB plus set of geospatial extensions. Recently released was a mostly rewritten version that features a super fast R-tree spatial indexing implementation. GeoCouch didn’t take long to set up and populate with GeoJSON. Of the many formats to describe geographic data, the most ubiquitous is perhaps GeoJSON. GeoJSON is a standardized way of representing geographic data in pure JSON, and therefore you can throw any GeoJSON object at any application with a JSON parser. CouchDB uses JSON for transferring data, so GeoCouch has naturally baked in support for GeoJSON.

I created some utilities to facilitate the conversion workflow from the raw Shapefiles that come from the government to documents in a GeoCouch instance. The overall process involves dumping the Shapefiles into a PostGIS instance, exporting GeoJSON from PostGIS, and bulk importing the GeoJSON into GeoCouch. This is the process that I have found to be the most fault tolerant when performing coordinate transformations against large datasets.

Once the data has been loaded into GeoCouch it instantly turns into a nice, clean REST API for developers who want to be able to retrieve bounding box queries against municipal datasets from Portland. For example anyone can ask my GeoCouch, PDXAPI (http://www.pdxapi.com), to return a list of bicycle friendly trails in any rectangularly shaped region.

=== Benefits ===

I initially wrote a simple proximity query server in Ruby that was able to do a proximity query and return objects from a dataset that were closest to a specified point. You could retrieve, for example, the closest 5 bus stops to your current location. The proximity lookup itself was very much brute force and didn’t use any type of spatial indexing. This was okay for a prototype, but definitely wouldn’t have scaled very far. GeoCouch’s spatial indexer, on the other hand, only has to index a particular dataset once and then subsequent lookups are incredibly snappy. GeoCouch’s R-tree is literally thousands of times faster that my own implementation. GeoCouch does the heavy lifting and lets me relax.

Being able to offer read access to large amounts of municipal data is great, but switching to GeoCouch also lets me accept upstream changes from users, or even create entirely new databases. This is a huge win. Being able to edit documents means that you can, with minimal effort, let users edit any data in Couch in a wiki-like fashion.

There are many projects in Portland that are dedicated to sharing quantitative information about local objects and places. Urban Edibles (http://www.urbanedibles.org) lets anyone contribute the locations of publicly accessible fruit bearing trees or other edible plants. TapLister (http://www.taplister.com) lets users contribute to lists of microbrews on tap at Portland bars. PC-PDX (http://www.pc-pdx.com) is a community calendar for live music.

These are all examples of applications embracing the principles of the civic web. Whereas the social web tries to reinvent and replace real life conversation, the civic web simply tries to augment the systems we already use by encouraging efficient and convenient participation in the happenings of your neighborhood. Your neighborhood is where you work, where you raise children, and where you invest time and emotion, so tools that let people more proactively engage in their communities have a more wholesome and positive long term impact on those communities than social media does.

Don Park (@donpdonp) and I have been working on an example CouchApp for manipulating data in GeoCouch. We’ve adapted the CouchApp to be a wiki for food cart and truck information in the Portland area, available at http:// www.foodcartpages.com. If someone wants to start a community database of, say, publicly accessible rope swings in front yards around Portland, they can start a new database on my GeoCouch. They can then take the source code from Food Cart Pages (http://github.com/donpdonp/foodcartpages) and adapt it to work with their new rope swing dataset.

Going a step further, I’ve created an example iPhone native application (an Android app is in the works) for manipulating GeoCouch hosted data. The rope swing dataset developer can grab a copy of my iPhone source code (http:// github.com/maxogden/pdx-food-carts-mobile) and adapt it to work with their application. Instead of writing in pure Objective-C, I chose to craft the application using Titanium, a JavaScript framework for cross platform native mobile development. This means that anyone who knows JavaScript can jump in and edit the iPhone source code and tailor the application to fit their use case.

When developing the mobile application, I didn’t need a special CouchDB client- side library in order to interact with the remote data stored on GeoCouch. I was able to use the vanilla AJAX library included with Titanium and fully interact with the data stored on GeoCouch. This is a great example of the importance of the ubiquitous patterns like REST that are present in the design of CouchDB.

GeoCouch happily acts as the centralized data store for both the web based CouchApp version of Food Cart Pages, as well as the iPhone application. When a user on an iPhone takes a photo of a food carts menu and uploads it, anyone else consuming data from GeoCouch will see the new photo. As a bonus for using CouchDB, I get free support for conflict resolution and an easy way to store old revisions of documents.

At the end of the day, GeoCouch has, in a few months time, help me to create an ecosystem for regional government and open source developers to share information with citizens, and for those same citizens to share information back. Crafting architecture for sharing large amounts of information over the web is usually no easy task, but GeoCouch has let me focus on the big picture and not get bogged down in the details. I am hoping to enable other developers in Portland and other cities to also see the big picture and start creating applications that not only their communities enjoy, but that they themselves also enjoy.

Categories: Companies, NoSQL

CouchCamp Early Bird Registration Closes Tomorrow, August 17th

Couchio - About CouchDB - Mon, 08/16/2010 - 11:26

Hey all, this is just a quick reminder that our early bird sales for CouchCamp closes tomorrow, August 17th. Trust us, you want to be there :)

CouchCamp is ideal for anyone interested in learning more about CouchDB, including developers, administrators and business users. The three-day camp will include speaking sessions from Damien Katz, creator of CouchDB, Selena Deckelman, founder of Open Source Bridge, Ted Leung, director of advanced technology at Disney and Stuart Langridge of Canonical, makers of Ubuntu Linux. There will also be unconference sessions led by conference participants.

For more information or to register, visit: http://www.couch.io/couchcamp.

See you there!

Categories: Companies, NoSQL

nonrel-search updates: auto-completion and separate indexing

Django nonrel / NoSQL blog - Wed, 08/11/2010 - 15:50

It was planned already very long to add some remaining features from gae-search to nonrel-search and since we stopped developing gae-search we decided to make some of the premium features open-source. So let's see what changed.

New Features

We basically changed two things in nonrel-search: first it's possible to index a model via a separate definition i.e. without having to modify the model's source itself and second you can use our auto-completion feature from the good old gae-search days. :)

Separate indexing

So let's say you want to index some of your models. With the old version of nonrel-search you had to add a SearchManager to each model you want to search for. With the latest version of nonrel-search you have to define these indexes separately from your model like this:

# post.models
from django.db import models

class Post(models.Model):
    title = models.CharField(max_length=500)
    content = models.TextField()
    author = models.CharField(max_length=500)
    category = models.CharField(max_length=500)
# post.search_indexes
import search
from search.core import porter_stemmer
from post.models import Post

# index used to retrieve posts using the title, content or the
# category.
search.register(Post, ('title', 'content','category', ),
    indexer=porter_stemmer)

As you can see we use the new register function to make posts searchable by title, content and category leaving the author aside. The first parameter defines the model you want to index. The remaining parameters are just the same as for the old SearchManager. The register function automatically adds an index called 'search_index'. Of course it's possible to add multiple such search indexes, just register more of them and pass in a name for the index:

# post.search_indexes
...
search.register(Post, ('category', 'title' ),
    indexer=porter_stemmer,
    search_index='second_search_index')

Here we define a new index called 'second_search_index'.

In addition to defining the indexes we have to make sure that the register function will be executed. Nonrel-search provides a function called autodiscover which automatically searches the INSTALLED_APPS for "search_indexes.py" modules and registers all search indexes.

# search for "search_indexes.py" in all installed apps
import search
search.autodiscover()

urlpatterns = patterns(...)

You should call autodiscover in "urls.py" to make sure that your indexes get loaded. Now it's possible to search for models using the newly added search function:

from search.core import search
posts = search(Post, 'Hello world')

search takes two arguments: the first specifiing the model to search for and the second argument specifies the query used for searching. search automatically uses the index 'search_index'. If you want to use a different index just pass in the name of the desired index:

from search.core import search
# use the auto-completion index explained in the next section
posts = search(Post, 'Hello world', search_index='second_search_index')

Defining indexes separately from the model definition is especially useful for already existing Django apps so you can make them searchable without having to modify their source code. For example, it's possible to make users searchable via their first name and last name just by adding a search index in a separate module.

Auto-completion or "suggest-as-you-type"

Auto-completion is the first premium feature we make open-source. Let's say you want to add auto-completion for category names while creating posts. Nonrel-search makes this an easy job. In order to do so you first have to register a search index which uses the startswith indexer:

# post.search_indexes
...
# auto-completion index used to suggest categories
search.register(Post, ('category', ), indexer=startswith,
    search_index='autocomplete_index')

Then you can use the LiveSearchField which can be integrated into forms to define an auto-completion form:

# post.forms
from django import forms
from post.models import Post
from search.forms import LiveSearchField

class CreatePostForm(forms.ModelForm):
    category = LiveSearchField('/post/live_search/')

    class Meta:
        model = Post

Just pass LiveSearchField the URL to the auto-complete view which retrieves your posts. You can also configure the auto-completion behavior with additional parameters. See the documentation for more information. You can then use this form in your templates to display an auto-completed input field. So, the only thing left is a view returning the data required for auto-completion and to include the necessary Javascript / CSS files into your html:

# post.views
from post.models import Post
from search.views import live_search_results

def live_search(request):
 return live_search_results(request, Post, search_index='autocomplete_index',
     result_item_formatting=
         lambda post: {'value': u'<div>%s</div>' % (post.category),
         'result': post.category, })

Here, we use the function live_search_results and pass in the model class to search on, the name of the index to use for searching (default is 'search_index'), and a formatting function which specifies how your auto-completed posts will be displayed. Note that this function returns a dictionary having two items: 'value' specifies how to display your posts and 'result' specifies what to put into the input field when selecting a post from the auto-completed results list.

If you don't specify any result_item_formatting function, 'value' will be the escaped value of the first indexed property and 'result' will be the unescaped value of the same property.

Note that request has to include a GET parameter 'query'.

Here is the remaining code you have to add so that all necessary Javascript / CSS files will be included using the django-mediagenerator:

# settings
# list your css and js data here
MEDIA_BUNDLES = (
    ('main.css',
        ...,
        'search/jquery.autocomplete.css',
        'search/search.css',
    ),
    ('main.js',
        ...,
        'jquery.js',
        'jquery.livequery.js',
        'search/jquery.autocomplete.js',
        'search/autocomplete_activator.js',
    ),
)

and in your templates add this:

...
{% block css %}
  {% include_media 'main.css' %}
{% endblock %}

{% block js %}
  {% include_media 'main.js' %}
{% endblock %}

You can download the nonrel-search-testapp to get started and play around with nonrel-search. If you create some nice app using nonrel-search please let us know.

Categories: Blogs, NoSQL

Keyspace 1.8: Now features key expiry

Keyspace 1.8 is out, featuring key expiry in all the client libs: Python, Java, .NET, PHP, Ruby, Perl, HTTP!

This enables Keyspace to be used instead of Memcached in some use-cases. Unlike Memcached, Keyspace stores key expiries safely on disk, so keys are expired even if servers restart.

Key expiry commands are implemented as an overlay feature. This means that when you set an expiry on a key, Keyspace does not check whether the key exists, it just remembers that it should expire (delete) thay key at the given time. You can create the key at a later time, overwrite it, rename it, delete it, re-create it, all these operations do not affect the expiry.

There are three key expiry commands in Keyspace:

  • set_expiry(k, t): set the key-value pair k => v to expire in t seconds
  • remove_expiry(k): remove any outstanding expiries on the the key-value pair k => v
  • clear_expiries(): remove all outstanding expiries in the database

For the actual commands check the section of the documentation corresponding to your programming language.

Key expiry is implemented as an overlay feature to enable developers to mix and match these commands with the regular Keyspace commands to match their desired semantics. For example, if a developer thinks that setting (changing) a key-value pair should automatically remove any expiries, he can create a wrapper library which issues remove_expiry(k) command before issuing set(k, v).

When using key expiry commands in replicated mode, you should use NTP (Network Time Protocol) to synchronize the server's clock. Note that other than key expiries, Keyspace does not require or assume clock synchrony. When the Keyspace master receives a set_expiry(k, t) command, it adds t seconds to the current timestamp and replicates that timestamp. Key expiry will occur at that time by the actual master's system clock. If the master fails and another node becomes the master, key expiry will still occur at that time, but by the new master's system clock.

Finally, some Python sample code which illustrates key expiry in action:

import keyspace
import time
# connect to a single keyspace instance client = keyspace.Client(["localhost:7080"])
# clear all keys client.prune("")
# clear all expiries client.clear_expiries()
# create some keys client.set("foo1", "bar1") client.set("foo2", "bar2") client.set("foo3", "bar3")
# expire foo1 and foo2 in 10 seconds client.set_expiry("foo1", 10) client.set_expiry("foo2", 10)
# changed my mind about foo2 client.remove_expiry("foo2")
# sleep for 10 seconds time.sleep(10)
# now list db kv = client.list_key_values("") for k in kv: print("%s => %s" % (k, kv[k])) # foo2 and foo3 are listed

Categories: NoSQL, Open Source

Press Release: Apache CouchDB Now Available on Google Android

Couchio - About CouchDB - Tue, 08/10/2010 - 19:51

Developers can now build web or native mobile applications taking advantage of CouchDB’s peer-to-peer sync 

Oakland, CALIF. – August 10, 2010 – Couchio (http://www.couch.io/), corporate sponsor of the CouchDB post-relational database, today announced that the first release of a CouchDB SDK for Android devices is now available for free download. Designed to take full advantage of CouchDB’s peer-to-peer sync facilities, CouchDB for Android allows developers to build web or native applications that work even if the Internet connection is slow, intermittent or completely down. With continuous access to a local copy of data, developers can leverage their existing knowledge about web technologies to quickly build collaborative business applications on mobile devices.
 

CouchDB for Android allows shared applications to work offline by automatically synchronizing  between platforms, alleviating a common pain point for users. Developers no longer have to develop an application once for the web, once for each mobile platform and then synchronize between the two. 
 

 “Our goal is to provide users with a kick-ass SDK for Android devices to build web and native applications using CouchDB as the device-native data store,” said Damien Katz, creator of CouchDB and CEO of Couchio. “CouchDB now makes sync ubiquitous and part of the mobile computing fabric.”
 

With CouchDB on Android, developers can build applications and access their data freely across devices, desktops or in the cloud, regardless of the network. Palm has already announced that the next version of it webOS will include services for syncing local data with CouchDB. 

For more information about CouchDB on Android, or to download it for free, visit  http://www.couch.io/android. From an Android device, intersted parties can directly install CouchDB through the Android Marketplace.


About Couchio

Couchio (http://www.couch.io/), co-founded by the creator of Apache CouchDB, is the commercial CouchDB company providing services, support, training and hosting. CouchDB is an open source database designed for the reporting and storage of large amounts of semi-structured, document oriented data, unlike SQL databases, which store and report on very structured and correlated data. CouchDB changes the way document-based applications are built, benefiting from the cloud while also keeping data available at the network edges via replication. Couchio has received venture funding from Redpoint Ventures.

Media Contact:

Ray George

Page One PR

650-922-3825

ray@pageonepr.com

Categories: Companies, NoSQL

Because we just couldn’t not do it.

Couchio - About CouchDB - Fri, 08/06/2010 - 19:22


Because we just couldn’t not do it.

Categories: Companies, NoSQL

RelaxBack for Thursday 8/5/2010

Couchio - About CouchDB - Fri, 08/06/2010 - 00:58
Upcoming Events

Tonight! 7pm socal.js Mikeal will be talking about CouchDB and node.js.

CouchCamp tickets are on sale for $500 (all inclusive) until August 17th.

RESTFest is going to be September 17th - 18th in South Carolina.

Recent Happenings

New Case study on Migrating to CouchDB from a Relational Database.

Enzo has another post in his series about using CouchDB with Rails about understanding map/reduce.

Damien and Aaron Miller have erlang running on iOS :)

CouchCamp attendee spotlight on Max Ogden.

Lena Hermann handed in her Thesis on Realisation of a Distributed Application Using the Document-Oriented Database CouchDB.

jchris wrote up a new description of CouchApps.

Jan started a page for everyone to add their local CouchDB meetups.

New expanded docs on installing CouchDB on Windows.

Jobs

Couchio is hiring a ton of roles! 6 week vacation
. I gotta figure out somewhere to travel to .

Categories: Companies, NoSQL

MongoDB 1.6 Released

The MongoDB NoSQL Database Blog - Thu, 08/05/2010 - 17:28

MongoDB 1.6.0 is the fourth stable major release (even numbers are “stable” : 1.0, 1.2, 1.4, 
) and is the culmination of the 1.5 development series.

Scale-out

The focus of the 1.6 release is scale-out.  Sharding is now production-ready.  The combination of sharding and replica sets allows one to build out horizontally scalable data storage clusters with no single points of failure.

A single instance of mongod can be upgraded to a distributed cluster with zero downtime when the need arises.

A big thanks to all the 1.5.x beta testers of sharding (including foursquare and bit.ly who have been using sharding in production for a while now).

Replica Sets 

Replica sets allow you to setup a high availability cluster with automatic fail over and recovery.  Replica pair users should, when convenient, migrate to replica sets.

Other Improvements in v1.6

  • acknowledged replication: The w option (and wtimeout) force writes to be propagated to N servers before returning success (works well with replica sets).
  • $or queries
  • Up to 64 indexes/collection
  • Improved concurrency
  • $slice operator
  • Support for UNIX domain sockets and IPv6
  • Windows service improvements
  • The C++ client is a separate tarball from the binaries

Downloads: http://www.mongodb.org/display/DOCS/Downloads

Release Notes: http://www.mongodb.org/display/DOCS/1.6+Release+Notes

Full change log

Please report any issues to http://groups.google.com/group/mongodb-user (support forums) or http://jira.mongodb.org/ (bug/feature db).

What’s Next

Now that 1.6 is out, we’re going to be focusing on 1.8.  Help us prioritize features for this release by voting for your key needs at jira.mongodb.org.  The #1 feature queued for v1.8 is single server durability.

More Information

Please join 10gen CEO and Co-Founder Dwight Merriman for the webinar What’s New in MongoDB v1.6 on Tuesday, August 10 at 12:30pm ET / 9:30am PT.

Categories: NoSQL

django-mediagenerator: total asset management

Django nonrel / NoSQL blog - Thu, 08/05/2010 - 14:45

We really weren't posting often enough recently. Now we'll make up for it with an awesome new asset manager called django-mediagenerator. Those of you who used app-engine-patch might still remember a media generator. This one is completely rewritten with a new backend-based architecture and muchos flexibility for the shiny new HTML5 web-apps world (see the feature comparison table). In this post I'll give you a quick intro and after that I'll make another post about some crazy stuff you can do with the media generator.

Why oh why?

What is an asset manager and why do you need one? Primarily, asset managers are tools for combining and compressing your JS and CSS files into bundles, so instead of many small files your website visitors only need to download a single big JS file and a single big CSS file. This is important because request latency has a much bigger impact on your site's load times than file size. You should definitely read Yahoo's Exceptional Performance and Google's Speed pages to learn more about how to improve your site's performance.

The second important task of an asset manager is to help you with handling HTTP caches. This is done by renaming your files, such that they contain a version tag. For example "main.css" could be renamed to "main-efe88bad66a.css". Whenever the file's contents change the version tag is updated, so the browser will not use the cached version of your file, but download the updated file.

Django already has lots of existing solutions for managing your JS and CSS files and images, so why oh why do we make yet another one? Well, they don't provide the flexibility we need:

  • Integration with Sass, PyvaScript, pyjs (the Python->JS compiler used in Pyjamas), etc.
  • Flexible backend system for other converters (CleverCSS, etc.)
  • Versioning for everything (including images)
  • Support for image spriting
  • Uncompressed and uncombined output during development with "runserver" for easier debugging
  • Works in sandboxed environments like App Engine

Similar to django-compress, django-mediagenerator stores bundles in the file system. The bundles are defined in settings.py. Some people prefer to define bundles in their templates. Why don't we define bundles in templates?

  • It doesn't work in sandboxed hosting environments like App Engine because all files have to be statically pre-generated in advance
  • It can lead to unnecessary bundles and thus slow page loads if different pages have only slightly different scripts
  • The configuration is not very flexible (you can only list a few JS/CSS files)
  • It adds unnecessary checks to every request whether file contents have changed

Even if you'd say that these definitions belong into the templates the disadvantages are much bigger than that little increase in "comfortability".

Let's use it!

We tried our best to make it easy to use the media generator. On the development server we provide a built-in view for serving files. If you want to generate files for production you just run manage.py generatemedia. We also provide two simple template tags for referencing media files in your templates.

Let's first install the media generator. Just download and extract the source code and run setup.py install and add "mediagenerator" to your INSTALLED_APPS. Then, define your bundles in settings.py:

MEDIA_BUNDLES = (
    ('main.css',
        'reset.css',
        'design.css',
    ),
    ('main.js',
        'jquery.js',
        'jquery.autocomplete.js',
    ),
)

GLOBAL_MEDIA_DIRS = (os.path.join(os.path.dirname(__file__), 'media'),)

Bundles are defined as tuples where the first entry is the bundle name and the remaining entries list the file names that should be combined. Here, we have a "main.css" bundle which combines "reset.css" and "design.css". The second bundle is named "main.js" and it combines "jquery.js" and "jquery.autocomplete.js".

In the last line we've added the "media" folder in the project's root directory to the media search path which defines where to find media files. Additionally, all "media" folders in your apps specified via INSTALLED_APPS are added to the media search path. Only the admin app is removed, by default.

You can also tell the media generator to use YUICompressor in settings.py:

ROOT_MEDIA_FILTERS = {
    'js': 'mediagenerator.filters.yuicompressor.YUICompressor',
    'css': 'mediagenerator.filters.yuicompressor.YUICompressor',
}

YUICOMPRESSOR_PATH = os.path.join(os.path.dirname(os.path.dirname(__file__)),
                                  'yuicompressor.jar')

In your project you'll need a convention where to find the YUICompressor jar file. In this case we assume that it's in the project's parent folder and called "yuicompressor.jar".

Finally, you need to define the media urls on the production and development server in settings.py:

MEDIA_DEV_MODE = DEBUG
PRODUCTION_MEDIA_URL = '/media/'
if MEDIA_DEV_MODE:
    MEDIA_URL = '/devmedia/'
else:
    MEDIA_URL = PRODUCTION_MEDIA_URL

The media generator needs a PRODUCTION_MEDIA_URL which is used when generating production media files. Additionally, MEDIA_DEV_MODE specifies whether to use development or production URLs in your templates (more on that in the templates section).

The last step is to add the development view to your urls.py:

from django.conf.urls.defaults import *
from django.conf import settings

urlpatterns = patterns('',
    # ...
)

if settings.MEDIA_DEV_MODE:
    # Generate media on-the-fly
    from mediagenerator.urls import urlpatterns as mediaurls
    urlpatterns += mediaurls

This will automatically handle requests at MEDIA_URL. If you want to generate the media files for production use you can just run manage.py generatemedia. This will store all generated files and copy all images into the "_generated_media" folder. Also, this creates a "_generated_media_names.py" module which stores the mapping from unversioned file names to versioned file names.

Let's add media to our templates

Now you know how to define bundles, but that's pretty useless if you can't reference those bundles from within your templates. Now this time I'll make it really simple. :)

In your template you first need to add the media generator template library via

{% load media %}

Then you can include JS and CSS directly using e.g.:

{% include_media 'main.css' %}

This will automatically generate <link> and <script> tags. In development mode (MEDIA_DEV_MODE = True) your files are not combined, so this will generate multiple <script> tags instead of a single one. This is very useful because your JS tracebacks will point directly to the file that caused an exception instead of a huge spaghetti code soup file (in that case only grep can save you).

Image URLs can be generated using e.g.:

<img src="{% media_url 'some/image.png' %}" />
URLs in CSS files

Whenever you write an URL via url(some/relative/path...) in your CSS files the URL gets rewritten to the actual generated file name. This is only done with relative URLs. Absolute URLs stay untouched (e.g., those that start with "/" or "http(s)://").

Installation on App Engine

Add the following handler to your app.yaml:

- url: /media
  static_dir: _generated_media/

If you use Django-nonrel that's all you need.

If you use some alternative Django setup (app-engine-patch, Django helper, etc.) you'll also need to add this at the top of your main.py handler (or whatever you've called your handler in app.yaml):

import os
if os.environ.get('SERVER_SOFTWARE', '').lower().startswith('devel'):
    try:
        from google.appengine.api.mail_stub import subprocess
        sys.modules['subprocess'] = subprocess
        import inspect
        frame = inspect.currentframe().f_back.f_back.f_back
        old_builtin = frame.f_locals['old_builtin']
        subprocess.buffer = old_builtin['buffer']
    except Exception, e:
        import logging
        logging.warn('Could not add the subprocess module to the sandbox: %s' % e)

This will enable Python's subprocess module which is needed by some media generator backends.

Now it's your turn

In the repository you can find a sample project with a little CSS example. If you have installed Django it should work out-of-the-box.

This post should get you started with the most common use-cases. There's a lot more that can be done with the media generator. We'll talk about the really exciting stuff in the next media generator post.

Update: The next post is Using Sass with django-mediagenerator

Categories: Blogs, NoSQL

CouchCamp Attendee Spotlight: Max Ogden

Couchio - About CouchDB - Wed, 08/04/2010 - 21:49

Max Ogden is a programmer from Portland, OR. Max is becoming quite well known in the open government applications community with his recent work PDXAPI which won the Civic Apps award for best overall utilization of data. Max was also the first CouchCamp ticket buyer and we’re incredibly excited to have him attending.

What was your first CouchDB project?

Trying to set up the old version of GeoCouch back when it had dependencies on Python and Spatialite. I spent more time trying to get the dependencies to compile than I did actually working with any actual geographic data. When the new GeoCouch came out and it didn’t have any external dependencies I was way excited.

What are you currently working on?

I’m working on PDX API (http://pdxapi.com), a developer interface to civic geo datasets in Portland, OR. It’s a big GeoCouch instance that has a bunch of geographic datasets that mostly come from Portland’s regional government agencies.

What is your favorite part of CouchDB?

The ubiquity of JSON and JavaScript. It’s easy to get people excited about working with Couch, since a lot of developers already think RESTfully and throw JSON objects around all the time. CouchApps are also really exciting.

What are you looking forward to at CouchCamp?

Kickball in Marin county in late summer, seeing what other people are working on.

What is your favorite color?

#FFB901

What drink(s) would you like to see at CouchCamp?

Some fancy draft root beer (I don’t drink drink)

Categories: Companies, NoSQL

Diplom Thesis: Realisation of a Distributed Application Using the Document-Oriented Database CouchDB by Lena Herrmann

Couchio - About CouchDB - Wed, 08/04/2010 - 16:19

Lena Herrmann & Thesis

Major Congrats to Lena Herrmann for handing in her Diplom Thesis on Realisation of a Distributed Application Using the Document-Oriented Database CouchDB.

It’s whooping 163 pages containing all the nitty-gritty-researchy details on why CouchDB is the number one choice for writing distributed applications in both the small and large scale.

Its review is pending but we’ll give you a shout when the full text is available. The great folks at UPSTREAM where Lena wrote the thesis are contributingthe text back to the wider community. Thank you Lena & UPSTREAM!

Categories: Companies, NoSQL