Using Python to Compare Document IDs in Two CouchDB Databases

I’m doing a bit of research into what may or may not be an issue with a specific database in our BigCouch cluster, but regardless of the outcome of that side of things I thought I’d share how I used Python and couchdb-python to dig into the problem.

In our six-server BigCouch cluster we noticed that on the database for one of our most heavily trafficked applications the document counts displayed in Futon for each of the cluster members don’t match. As I said above this may or may not be a problem (I’m waiting on further information on that particular point), but I was curious which documents were missing from the cluster member that has the lowest document count. (The interesting thing is the missing documents aren’t truly inaccessible from the server with the lower document count, but we’ll get to that in a moment.)

BigCouch is based on Apache CouchDB but adds true clustering as well as some other very cool features, but for those of you not familiar with CouchDB, you communicate with CouchDB through a RESTful HTTP interface and all the data coming and going is JSON. The point here is it’s very simple to interact with CouchDB with any tool that talks HTTP.

Dealing with raw HTTP and JSON may not be difficult but isn’t terribly Pythonic either, which is where couchdb-python comes in. couchdb-python lets you interact with CouchDB via simple Python objects and handles the marshaling of data between JSON and native Python datatypes for you. It’s very slick, very fast, and makes using CouchDB from Python a joy.

In order to get to the bottom of my problem, I wanted to connect to two different BigCouch cluster members, get a list of all the document IDs in a specific database on each server, and then generate a list of the document IDs that don’t exist on the server with the lower total document count.

Here’s what I came up with:

>>> import couchdb
>>> couch1 = couchdb.Server(‘http://couch1:5984/’)
>>> couch2 = couchdb.Server(‘http://couch2:5984/’)
>>> db1 = couch1[‘dbname’]
>>> db2 = couch2[‘dbname’]
>>> ids1 = []
>>> ids2 = []
>>> for id in db1:
…     ids1.append(id)
… 
>>> for id in db2:
…     ids2.append(id)
… 
>>> missing_ids = list(set(ids1) – set(ids2))

What that gives me, thanks to the awesomeness of Python and its ability to subtract one set from another (note that you can also use the difference() method on the set object to achieve the same result), is a list of the document IDs that are in the first list that aren’t in the second list.

The interesting part came when I took one of the supposedly missing IDs and tried to pull up that document from the database in which it supposedly doesn’t exist:

>>> doc = db2[‘supposedly_missing_id_here’]

I was surprised to see that it returned the document just fine, meaning it must be getting it from another member of the cluster, but I’m still digging into what the expected behavior is on all of this. (It’s entirely possible I’m obsessing over consistent document counts when I don’t need to be.)

So what did I learn through all of this?

  • The more I use Python the more I love it. Between little tasks like this and the fantastic experience I’m having working on our first full-blown Django project, I’m in geek heaven.
  • couchdb-python is awesome, and I’m looking forward to using it on a real project soon.
  • Even though we’ve been using CouchDB and BigCouch with great success for a couple of years now, I’m still learning what’s going on under the hood, which for me is a big part of the fun.

CouchDB Tip: When You Can’t Stop the Admin Party

I was setting up a new CouchDB 1.2 server today on Ubuntu Server, specifically following this excellent guide since sudo apt-get install couchdb still gets you CouchDB 0.10. Serious WTF on the fact that the apt installation method is years out of date — maybe I should figure out who to talk to about it and volunteer to maintain the packages if it’s just a matter of people not having time.

The installation went fine until I attempted to turn off the admin party, at which point after I submitted the form containing the initial admin user’s name and password things just spun indefinitely. And apparently adding the admin user info manually to the [admin] section of the local.ini file no longer works, since it doesn’t automatically encrypt the password you type into the file on a server restart like it used to.

Long and short of it is if you see this happening, chances are there’s a permission problem with your config files, which are stored (if you compile from source) in /usr/local/etc/couchdb. In my case that directory and the files therein were owned by root and I’m not running CouchDB as root, so when I tried to fix the admin party the user that’s running CouchDB didn’t have permission to write to the files.

A quick chown on that directory structure and you’re back to being an admin party pooper.

CouchDB Resources List

Since I did quite a bit of research for my post on authentication and security in CouchDB I figured I’d share what I came across as a link dump. Enjoy!

Reference Material

General Info and Tutorials

CouchDB in Government

General Case Studies

Search

The Definitive Guide to CouchDB Authentication and Security

With a bold title like that I suppose I should clarify a bit. I finally got frustrated enough with all the disparate and seemingly incomplete information on this topic to want to gather everything I know about this topic into a single place, both so I have it for my own reference but also in the hopes that it will help others.

Since CouchDB is just an HTTP resource and can be secured at that level along the same lines as you’d secure any HTTP resource, I should also point out that I will not be covering things like putting a proxy in front of CouchDB, using SSL with CouchDB, or anything along those lines. This post is strictly limited to how authentication and security work within CouchDB itself.

CouchDB security is powerful and granular but frankly it’s also a bit quirky and counterintuitive. What I’m outlining here is my understanding of all of this after taking several runs at it, reading everything I could find on the Internet (yes, the whole Internet!), and a great deal of trial and error. That said if there’s something I’m missing or not stating accurately here I would LOVE to be corrected.

Basically the way security works in CouchDB is that users are stored in the _users database (or elsewhere if you like; this can be changed in the config file), and security revolves around three user roles:

  • Server admin
  • Database admin
  • Database reader

Notice one missing? That’s right, there is not a defined database reader/writer or database writer role. We’ll get to that in a minute. And of course you can define your own roles provided that you write the functionality to make them meaningful to your databases.

Here’s how the three basic roles play out:

  • Server admins can do anything across the entire server. This includes creating/deleting databases, managing users, and full admin access to all databases, i.e. full CRUD on all documents as well as the ability to create/modify/delete views, run compaction and replication, etc. In short, god mode.
  • Database admins have full read/write access (including design documents) on specific databases and can also modify security settings on a specific database. (I don’t know if database admins can manage replication because I did not test that specifically.)
  • Database readers can only read documents and views on a specific database, and have no other permissions.

Even given all of this, reading and writing in CouchDB needs more clarification so you know what is and isn’t allowed:

  • By default all databases are read/write enabled for anonymous users, even if you define database admins on a database. Note that this includes the ability to call design documents via GET, but does not include the ability to create/edit/delete design documents. Once you turn off admin party you have to be a server or database admin in order to manage design documents.
  • If you define any database readers on a database anonymous reads are disabled, but anonymous writes (of regular documents, not design documents) are still enabled.
  • In order to prohibit anonymous writes, you must create a design document containing a validation function in each database to handle this (much more on this below).
  • Regardless of any other settings server admins always have full access to everything, with the exception that if you create a validation function the admin user’s access is impacted by any rules in that validation function. More on this below, but basically if you create a validation function looking for a specific user or role and the admin user doesn’t match the criteria, they’ll be blocked just like anyone else.

Blocking Anonymous Writes

So now we come to the issue of blocking anonymous writes (meaning create/update/delete), and it’s simple enough but I have no idea why this isn’t done at the user level. Maybe there’s a logical reason that isn’t written down anywhere, but why you can’t create a reader/writer user or role is a mystery to me.

But enough whining. Here’s how you do it.

To block anonymous writes you have to create a design document in the database that contains what’s called a validation function. This basically means that your design document must contain a validate_doc_update field, and the ID for this document follows the standard pattern for design documents, e.g. something like _design/blockAnonymousWrites The value of the validate_doc_update field is a function that will be run before all write operations, and it takes the new document, the old document (which would be null on create operations), and the user context in as arguments. This gives you access to everything you need to do simple things like check for a valid user, or more complex things like seeing if specific fields exist in the document that’s about to be written or even if there are conflicts on an update operation with the old version of the document that you want to reject.

Here’s a sample validation function that simply checks for a specific user name, foo, and rejects the write operation if the user is not foo:

function(new_doc, old_doc, userCtx) {   if(userCtx.name != 'foo') {     throw({forbidden: "Not Authorized"});   } }  

The userCtx object has properties of name and roles. The name property is the user name as a string, and roles is an array of role strings.

Let’s say you wanted to limit write operations to the role bar. To accomplish that you’d use JavaScript’s indexOf() method on the userCtx.roles array to see if the required role exists:

function(new_doc, old_doc, userCtx) {   if(userCtx.roles.indexOf('bar') == -1) {     throw({forbidden: "Not Authorized"});   } }  

Obviously on top of all of this you have access to all the properties of the document being posted as well as the old document if it’s a revision, and you can use all that information to do whatever additional validation you need on the document data itself before allowing the document to be written to the database.

Creating Users

As far as creating users is concerned you can either do this in Futon or (as with everything in CouchDB) via the HTTP API. Note that if you create users via Futon you need to be aware that if you are logged in as admin and click the “Setup more admins” link you’re creating a server admin. That means they have permission to do literally everything on that CouchDB server.

If you want to create a non-admin user make sure you’re logged out and click on the “Signup” link, and you can create a user that way. Note that this doesn’t work on BigCouch if you’re hitting Futon on port 5984 since the _users database lives on port 5986 in BigCouch, and that backend port is by default only accessible via localhost; more on that below. And big thanks to Robert Newson on the CouchDB mailing list for pointing that out since I was tearing my hair out a bit after my recent migration to BigCouch.

If you want to create users via the HTTP API, in CouchDB 1.2 or higher you simply do a PUT to the _users database via curl or another HTTP tool, or make an HTTP call via your favorite scripting language. I’ll show all the examples in curl since it’s language agnostic and universally available (not to mention because I find curl so damn handy).

curl -X PUT http://mycouch:5984/_users/org.couchdb.user:bob -d '{"name":"bob", "password":"bobspassword", "roles":[], "type":"user"}' -H "Content-Type: application/json"  

That will create a user document with an ID of org.couchdb.user:bob and a user name of bob, and bob is not a server admin. In CouchDB 1.2 it will see the password field in the document and automatically create a password salt and hash the password for you.

On versions of CouchDB prior to 1.2, or with servers based on versions of CouchDB prior to 1.2 such as BigCouch 0.4.0 (which is based on CouchDB 1.1.1), the auto-salt-hash bit does not happen. This means you need to salt and hash the password information and store the hashed password and the salt in the user document.

As a reminder in case you weren’t paying attention earlier: On BigCouch the _users database is on port 5986. This had me banging my head against my desk for the better part of an afternoon. It’s probably documented somewhere but you know geeks and reading manuals, so I’m sharing that important tidbit in the hopes it helps someone else.

To create a user on CouchDB < 1.2 or BigCouch 0.4.0 (which again is based on CouchDB 1.1.1) you first need to:

  • Create a salt
  • Hash the concatenation of the password and the salt using SHA1
  • Include the salt used as the salt property of your user document, and the hashed password as the password_sha property of your user document

There are numerous ways to do all of this and you can see some examples in various languages and technologies on the CouchDB wiki, but since openssl is standard and a quick and easy way to do things I’ll recap that method here.

First you need to generate a salt:
SALT=`openssl rand 16 | openssl md5`

Next echo that out just to make sure it got set properly:
echo $SALT

Next you concatenate whatever password you want + the salt, and then hash the password using SHA1:
echo -n "thepasswordhere$SALT" | openssl sha1

One caveat: if when you echo $SALT it contains (stdin) at the beginning like so:
(stdin)= 4e8096c4d0047e8d535df4b356b8d102

Make sure NOT to include the (stdin)= part in what you’re going to put into CouchDB. Ignore (stdin)= and the space that follows and use only the hex value.

After generating a salt and hashing the password the end result that you put in CouchDB looks something like this (you’d obviously replace thehashedpassword and thesalt with the appropriate values):
curl -X PUT http://mycouch:5984/_users/org.couchdb.user:bob -d '{"name":"bob", "password_sha":"thehashedpassword", "salt":"thesalt", "roles":[], "type":"user"}' -H "Content-Type: application/json"

Of course if you know when you’re creating the user that you want to grant them a specific role, you’d put that in the roles array. These roles will be contained in userCtx.roles in validation functions and you can act on that accordingly (see the above discussion about validation functions for more details).

And again note that if you’re on BigCouch use port 5986 for the _users database!

Summary

To sum all this up, here’s a handy-dandy chart.

If you want to … You need to …
  • Allow anonymous access to all functionality including creating and deleting databases
  • Do nothing! Leave admin party turned on. (At your own risk, of course.)
  • Disable anonymous server admin functionality (create/delete databases, etc.) but continue to allow anonymous read/write access (not including design documents) on all databases
  • Create at least one server admin user by clicking the “Fix this!” link next to the admin party warning on the lower right in Futon.
  • Allow a user who is not a server admin to have admin rights on a specific database
  • Create a non-server-admin user and assign them (by name or role) to be a database admin user on the specific database. This can be done via the “Security” icon at the top of Futon when you’re in a specific database, or via the HTTP API.
  • Block anonymous reads on a specific database
  • Create a non-server-admin user in CouchDB and assign them (by name or role) to be a database reader on the specific database. This can be done via the “Security” icon at the top of Futon when you’re in a specific database, or via the HTTP API.
  • Block anonymous writes on a specific database
  • Create a non-server-admin user in CouchDB and create a design document in the database that includes a validation function, specifically in a validate_doc_update property in the design document. The value of this property is a function (that you write) to check for a specific user name or role in the userCtx argument that is passed to the function, and you would throw an error in the function if the user or role is not one you want to write to the database.

And that’s more or less all I know about CouchDB security. I’ll end with some links if you want to explore further.

Any questions, corrections, or suggestions for clarification are very welcome. Hope some of you found this helpful!

Security/Validation Function Links

 

 

Revisiting Retrieving Documents Between Two Dates From CouchDB

In a previous post I outlined how I was retrieving documents from CouchDB with a start date property less than the current date, and and end date property greater than the current date. To summarize, in my CouchDB view I created some date/time strings in JavaScript and only emitted documents in the view that met the date criteria.

My previous post got referenced in the CouchBase newsletter, and I’m really glad it did because while I came up with what I thought was a clever solution it was also wrong. (D’OH!)

The issue I didn’t consider that some kind commenters on the previous post pointed out is that my approach creates side effects because I’m emitting documents in the view based on information that isn’t in the document itself. Specifically since I’m using the current system date/time when the view is created, the documents included in the view will be ones for which the criteria is valid when the view is created.

What this means is that although views get updated with current data as data within documents changes, since the entire view isn’t generated each time the criteria used to determine whether or not documents are included in the view is a fixed point in time. To put it another way, my current system date/time that was current when the view was first created essentially becomes hard-coded once the view is created, which isn’t at all what I needed. This causes issues if the start and end date properties in the documents change after they’ve been added to the view because the view only checked to see if the date criteria was met at the time the document was added to the view.

There are some great suggestions in the comments on my previous post for including data in the document itself that would allow only valid documents to be pulled right from Couch, and you’ll certainly want to check those out if you’re dealing with a ton of data. The solution I’m using will not be ideal for massive datasets but since that isn’t the situation I’m in with this data, I wanted to share the solution I came up with in case this works for other people.

To describe my documents again, I have documents that need to be displayed on a web page if their start date/time property is less than the current date/time and if their end date/time property is greater than the current date/time.

Since the valid ranges go in opposite directions for those fields, I didn’t see a way to do something like have an array key that included both the start and end dates that would allow me to get only the documents I want back from Couch. But what I can do is use a single document property as a key in Couch and get close to what I want, and then I can pare the documents down further in the application code.

In my case the end date is a more strong limiting criteria since over time there will be a large number of documents with both start and end dates in the past, but documents with end dates >= the current date will be much fewer in number (only a handful in the case of this specific data).

The first step to fix my issue was to rewrite my view to eliminate the date/time check in JavaScript since that’s the cause of the unwanted side effect, and emit documents using the end date/time property as the key. I have some other criteria as well (checking type and a couple of other fields to pull valid documents for this particular display), but the basic view is now very simple:


function(doc) {
  emit(doc.dtEnd, doc);
}

With the end date/time as the key, on the application side I can simply use the current date/time as my start key when I call this view, and that gives me all documents with a valid end date/time (>= current date/time).

At this point I may still have documents that shouldn’t be displayed based on the start date/time, however, since when people enter data into this application they can schedule things for future display (i.e. both start and end date/time are in the future). But, again since I’m not dealing with a huge amount of data once I limit by the end date/time, it’s simply a matter of looping over the documents I get back from Couch and checking for a valid start date/time (<= current date/time) and only displaying those documents.

The issue my original view code created makes total sense now, so thanks to the commenters on my previous post who pointed out the fatal flaw in my approach. Nothing like doing something wrong as a means of learning.

Retrieving Documents Between Two Dates From CouchDB

I’m working on converting yet another application from using SQL Server to using CouchDB, and this morning I’m working with some announcement documents that are displayed based on their start and end date. There are numerous ways to approach this problem but I thought I’d share what I came up with in case this solution helps others, and also to see if there’s maybe another approach I didn’t consider.

First, since there is no date datatype in JSON, we’ve standardized (for better or worse) on storing dates as a string with the format “YYYY/MM/DD HH:MM:SS”, e.g. “2011/08/27 09:22:36”, so date and time separated by a space, always with leading zeros for single digits, and always using a 24-hour clock. This allows date/time strings to sort properly when they’re used as keys, it’s easy to split the string using the space if you need either just the date or just the time, and since this application is for my day job the time will always be in Eastern US time so we decided not to care about the timezone offset.

In the data I imported from SQL Server there is a dtStart and a dtEnd field so I just converted the SQL Server dates to our preferred CouchDB date format as I imported the data into CouchDB. So far so good.

The next step was to pull these documents from CouchDB based on their dtStart and dtEnd fields, and this is probably obvious but just so it’s clear, I need to pull all documents of this type where dtStart <= now, and dtEnd >= now.

As I started creating my view in CouchDB for this, my first thought was to pull all the documents using an array including dtStart and dtEnd as the key. That way when I call the view I could, in theory, use a start and end key to get me the documents in the range of dates that I want.

That approach seems reasonable at first, but when you start trying to put it into practice things get weird rather quickly. This is because what you wind up needing is documents in which the first element of the key array is less than the current date, while the second element of the key array is greater than the current date. Maybe this is just “Saturday morning brain” on my part, but I didn’t see a way to include both the start and end date in the key and get where I needed to go.

My next thought was to use only the end date as the key. This gets me a bit closer to what I need since I can at least use a start key to only get documents with an end date >= now, but I’m still faced with having to check the start date at the application level to see if the document is supposed to be displayed.

I’m sure there’s some clever way to handle this situation with keys, and part of my reason for posting this is to see how others would approach this, but I messed around with keys for a while and didn’t seem to be getting anywhere so I decided to take a different approach.

One of the great things about CouchDB is the fact that you have the full power of JavaScript available in your views. Although JSON doesn’t know what a date is, JavaScript certainly does, so I decided that since I needed to pull things based on a specific date range across two fields in my documents the best place to handle that was in the view code itself.

Here’s what I came up with for my map function:


var d = new Date();
var curYear = d.getFullYear();
var curMonth = (d.getMonth() + 1).toString();
var curDate = d.getDate().toString();
var curHours = d.getHours().toString();
var curMinutes = d.getMinutes().toString();
var curSeconds = d.getSeconds().toString();

if (curMonth.length == 1) {
  curMonth = '0' + curMonth;
}

if (curDate.length == 1) {
  curDate = '0' + curDate;
}

if (curHours.length == 1) {
  curHours = '0' + curHours;
}

if (curMinutes.length == 1) {
  curMinutes = '0' + curMinutes;
}

if (curSeconds.length == 1) {
  curSeconds = '0' + curSeconds;
}

var dateString = curYear + '/' + curMonth + '/' + curDate + ' ' +
    curHours + ':' + curMinutes + ':' + curSeconds;

if (doc.type == 'announcement' &&
    doc.dtStart <= dateString &&
    doc.dtEnd >= dateString) {
      emit(doc.dtEnd, doc);
}

Now of course you could argue this would all be simpler if I stored the dtStart and dtEnd fields in my documents as milliseconds, because then I could just get the millisecond value of the current date and do a quick numeric comparison instead of all the string formatting and concatenation, and from that perspective you’d be absolutely right. One of the many things I love about CouchDB, however, is the ability to jump into Futon and more directly and easily interact with my data, so keeping the dates human readable is kind of nice. Now I could store both a string and the millisecond value I suppose, but since this did the trick I decided to leave well enough alone.

I’m very curious to hear how others might solve this problem. “You’re doing it wrong” information would be quite welcome. 😉

String Matching in CouchDB Views

We’re in the process of porting an application that has been running on SQL Server over to the fabulous and amazing CouchDB. We were originally under the impression that everyone accessing data from this application in their own code was doing so through our web service, which would have made our job pretty simple since we could swap the guts of the web service methods out and return the same data types to the caller, but upon further investigation we discovered that people had written their own custom queries directly against the database.

This alone isn’t a big deal but in some cases people were running queries that included LIKE clauses, and since we opted not to install CouchDB-Lucene given both time constraints as well as the fact that the LIKE queries against SQL Server were pretty limited in scope and number, I thought I’d share what we came up with to do string matching in views in CouchDB.

This is by no means to suggest you should not use CouchDB-Lucene if you want true full-text searching against data in CouchDB, but in our case this was an acceptable compromise.

Matching Fields That Start With a String in Couch

SQL Equivalent: “WHERE field LIKE ‘foo%'”

Let’s assume I have a database called test and in that database I have documents that have fields of firstName and lastName. I want to write a view that will let me do wildcard matches against first names that begin with a string.

This turns out to be pretty simple given how keys work in CouchDB map functions. Since a view emits a key and a value and we can use start and end keys in our calls to CouchDB, we simply provide the string against which we want to match as our start key and some end key that will ensure we don’t get back more than what we’re wanting.

For example, let’s say I want to match all documents in my database that start with ‘Mat’ so I can retrieve all people with a first name of Matt, or Matthew, or Mathew, or Mat, or Mathias … you get the idea.

First I write a view that in its map function emits firstName as the key:

function (doc) {
  if (doc.firstName && doc.lastName) {
    emit(doc.firstName, doc);
  }
}

Assume that my design document is ‘people’ and that’s the map function for a view called ‘byFirstName.’ To call that view and get back only people with a first name staring with ‘Mat’ I use the following URL:

http://couch/test/_design/people/_view/byFirstName?startkey="Mat"&endkey="MatZ"

In case that wraps poorly in the blog post display, here’s just the start and end keys:

startkey="Mat"
endkey="MatZ"

That tells CouchDB to start its output for that view with anything that starts with Mat and end once it hits anything that starts with MatZ.

Matching Specific Strings Contained in Fields

SQL Equivalent: “WHERE field LIKE ‘%KnownString%'”

We had some use cases where users had canned queries (i.e. users can’t enter random search terms) that were looking for a specific term contained anywhere within a specific field. I say specific term here and in the example I use “KnownString” because if you know the string ahead of time this is a simple problem to solve, whereas ad hoc terms are more problematic, but I’ll address that below.

Remember that within CouchDB views you have full access to JavaScript, so solving this use case is simply a matter of using a regex to match against the known term.

Let’s say I want to pull all documents that have a bio field containing the term ‘CouchDB’:

function(doc) {
  if (doc.bio && doc.bio.toUpperCase().match(/bCOUCHDBb/)) {
    emit(doc._id, doc);
  }
}

Again, since I know the term ahead of time I can do a regex match against it quite easily in my view.

Matching Ad Hoc Strings Contained in Fields

SQL Equivalent: “WHERE field LIKE ‘%adHocSearchTerm%'”

Where things get tricky in CouchDB without using something like CouchDB-Lucene is matching ad hoc strings. “Tricky” is actually putting it mildly, because the real story is you can’t do this in CouchDB. So in use cases where people had code that had a search box into which users could type anything, we had to come up with another solution.

What I’ve found as I’ve been using CouchDB more and more is that it can shift things that you used to do in the database layer up into the application layer, and vice-versa. So in this case it was simply a matter of coming up with a view that pulled back a subset of documents into the application code, and then doing the matching there.

One caveat here is that since our database contains thousands of documents, it wasn’t really feasible to pull back all the documents in the database and then perform matching in the application layer. Since these documents all have a date associated with them, what we wound up doing is using date range as start and end keys as a way of reducing the number of documents we have to match against in the application. This wasn’t a huge burden on users and certainly will improve performance.

We wound up limiting documents returned by year (i.e. the users have to choose a year in which to search), which is enough of a range to not make things too annoying for users, but is also a small enough set of documents not to kill performance on the application side.

To call the view that uses date as its key, the URL params look like this to pull back all documents for 2011 in descending date order:

?startkey="2012/01/01"&endkey="2011/01/01"&descending=true

Remember that when you order descending you essentially flip the start and end keys around, hence why 2012/01/01 is used as the start key.

Once I have the documents back, I then deserialize the JSON into something usable by CFML and then loop over the documents to do my further refinement by search term.

Leaving out the subset controlled by date I described above, assuming I wanted to find all people with a bio field that contained the search term entered by a user on a form, the code winds up looking something like this:

<cfhttp url="http://server/test/_design/people/_view/hasBio"
        method="get"
        result="peopleJSON" />

<cfset peopleReturned =
        DeserializeJSON(peopleJSON.FileContent).rows />

<cfset matchingPeople = ArrayNew(1) />

<cfloop array="#peopleReturned#" index="person">
  <cfif FindNoCase(form.searchTerm, person.value.bio) neq 0>
    <cfset ArrayAppend(matchingPeople, person) />
  </cfif>
</cfloop>

What we wind up with there is the matchingPeople array will contain only the people who had the search term included in their bio field.

The big caveat here again is that if you have a huge number of documents you can get into trouble on the application side, so make sure and limit what you get back from CouchDB since you’ll wind up looping over all of those documents to do your search term matching.

Hope that helps others do some quick and dirty LIKE type queries in CouchDB. If there’s a better way to do any of these I’m all ears!

cf.Objective() NoSQL BOF

Heads up that on Friday night of cf.Objective() I'll be facilitating a BOF on using NoSQL databases with CFML, so if you're interested in things like CouchDB (my favorite thing on the planet as of late), MongoDB, or any of the numerous others please come to the BOF!

All skill levels are welcome so come to learn, come to share what you've done, or come to mock crazy people like myself who think the relational model is the biggest hoax ever perpetrated on the technology world and that we should have been using document-based datastores all along. Yes, that statement is meant to incite you to come to the BOF if you think I'm wrong, but I do believe it to a certain extent. 😉

When I say I'll be facilitating a BOF I mean just that–BOFs are meant to be highly participatory, free-form discussion forums, so while I'm happy to show off what I know about CouchDB, I'd personally love to learn more about some of the other NoSQL databases from people using those, and would love to have some heated discussions about NoSQL in general.

See you Friday night at 8 pm!

Accessing and Restarting Desktop CouchDB on Ubuntu/Mint

Recent versions of Ubuntu (and Ubuntu-based distros like LinuxMint) ship with Desktop CouchDB to interact with Ubuntu One and store things like replicated bookmarks in Firefox, contacts in Evolution, and some other data.

If you want to access Futon (CouchDB's web-based admin tool) for this instance of CouchDB you need to do a bit of hunting, but I found this page on freedesktop.org that was very helpful, and I thought I'd document here as well in case I forget this information in the future (which I'm sure I will!).

Accessing Futon

Open a terminal and navigate to ~/.local/share/desktop-couch and open couchdb.html in a browser (e.g. firefox couchdb.html), or navigate to file:///.local/share/desktop-couch/couchdb.html in your browser. This takes you to a page that will redirect you to Futon after a few seconds, at which point you can see which port CouchDB is running on and what the admin user name is.

If CouchDB Desktop Isn't Running

In my case CouchDB Desktop wasn't running for some reason so I had to follow these steps to get it going again:

  1. Open a terminal and do killall beam.smp and then killall beam (do this as your user, not as root or using sudo). I got 'no process found' errors in both cases but this will make sure all CouchDB Desktop processes have been killed.
  2. Again in a terminal, do rm ~/.config/desktop-couch/desktop-couch.ini
  3. Still in your trusty terminal, do dbus-send –session –dest=org.desktopcouch.CouchDB –print-reply –type=method_call / org.desktopcouch.CouchDB.getPort
    This will restart CouchDB and tell you what port it's running on.
  4. Open the couchdb.html file referenced above and you should be redirected to Futon

Grails + CouchDB #s2gx

Scott Davis – thirstyhead.com

NoSQL Databases in General

  • given the number of big companies using them, clearly they're ready to use today
  • time to re-examine our unnatural obsession for relational databases
  • rdbms has been around for 50 years now–well understood, great tooling, lots of information
  • rdbmses are silos
    • still good at what they do, but aren't necessarily well-suited to all data
  • as developers we're being forced to use sql to express something that's crucial to the success of your application
    • not our native language, kind of foreign when it comes down to it
  • we use orm to insulate ourselves from sql
    • express yourself in the native language of your choice instead of in sql

Is ORM State of the Art?

  • really just a bridge
  • why aren't there pure java or groovy datastores?
  • persistence is pretty uninteresting to developers
  • orm is a reasonable bridge, but a rather leaky abstraction as well
  • ted neward: orm is the vietnam of computer science
    • "[ORM] represents a quagmire which starts well, gets more complicated as time passes, and before long entraps its users in a commitment that has no clear demarcation point, no clear win conditions, and no clear exit strategy."

What Drew Me to CouchDB

  • what if i didn't have to bridge technologies anymore?
  • what if i could save my objects in their native format?
    • couchdb is actually a json datastore, but grails makes it trivial to transfer pogo <-> json
  • just need a thin translation layer

NoSQL Solutions

  • Google BigTable
  • mongoDB
  • CouchDB
  • Cassandra
    • "this is the future, but no one believes us"
  • each one of these are a bit different and each has their strengths and weaknesses
  • NoSQL = "not only SQL"
  • don't think of nosql solutions as just another database; truly different way to think about persistence
  • if you think of it as just another database, it'll be the worst database you've ever used
  • need to get out of the mindset of "spreadsheet" type format for data
  • start thinking more about the right tool for the job

CouchDB History

  • starting point was Lotus Notes
    • largely ahead of its time
    • document database
    • not brand-new stuff–ideas and foundation has been around for a very long time
  • Apache project

RDBMS vs. CouchDB

  • rdbms
    • row/column oriented
    • language: sql
    • insert, select, update, delete
  • CouchDB
    • if your data has a more vertical orientation as opposed to horizontal, starts to look more like attachments
    • email is a good example: to, from, body, attachment
    • language: javascript (map/reduce functions)
    • put, get, post, delete (REST)
    • "Django may be built for the Web, but CouchDB is built of the Web." — Jacob Kaplan-Moss, Django Developer
    • can build entire apps in CouchDB
  • Couch = acronym for "cluster of unreliable commodity hardware"
  • clustering is much more difficult to do clustering–couch was built from the ground up to be massively distributed, clusters out of the box
  • O'Reilly book available — free online

Using CouchDB With Grails

  • grails has native json support out of the box

import grails.converters.* class AlbumController { def scaffold = true def listAsJson = { render Album.list() as JSON } def listAsXml = { render Album.list() as XML } } CouchDB 101

  • json up and down
  • restful interface
  • no drivers since it's just http
  • written in erlang
    • incredibly fast
    • designed for scalability and parallel processing

Installing CouchDB

  • sudo apt-get install couchdb
  • windows installer available

Kicking the Tires

  • ping
    • curl http://localhost:5984
      {"couchdb":"Welcome","version":"1.0.1"}
    • can also hit this in a browser, but of course can't do a POST from a URL in a browser
  • get databases
  • create a database
  • delete a database
  • uses standard HTTP response codes, e.g. a 201 response code for a database create
  • web UI available – "Futon"
  • create a document
  • create a document from a file
  • URIs for documents are essentially your primary key–unique way of representing the document
  • don't have to create schemas — just start throwing documents at the database
  • documents get etags so they're very cache friendly
  • documents also get revisions–keeps tracks of multiple versions of the document
    • have to provide version number when updating
    • versioning numbers are revision number (integer), then -, then md5 hash of the document itself
    • can explicitly compress the database to get rid of old versions to reduce size of database
  • couch prefers uuids for the ids, but you can use anything you want
  • get UUID(s) from couch
  • to update a document, you'll get the latest version of the document, then do the update, then pass your changes back to couchdb which includes the revision number
  • one of the major things couchdb gives you since it's document based is that the data is accurate at that point in time
    • if the data changes in the future, in an rdbms the old document would get the new data

CouchDB With Grails

  • domain class–id and _rev as properties
  • can add couchdb stuff to Config.groovy to do stuff like create-drop for couchdb databases
  • add stuff to BootStrap.groovy
  • showing CouchDBService that has convenience methods around a lot of the URL calls to couch

Map/Reduce

  • in sql you say select firstname, lastname from foo (this is map) where state = 'NE' (this is reduce)
  • map and reduce are stored in 2 separate javascript functions