String Matching in CouchDB Views

We’re in the process of porting an application that has been running on SQL Server over to the fabulous and amazing CouchDB. We were originally under the impression that everyone accessing data from this application in their own code was doing so through our web service, which would have made our job pretty simple since we could swap the guts of the web service methods out and return the same data types to the caller, but upon further investigation we discovered that people had written their own custom queries directly against the database.

This alone isn’t a big deal but in some cases people were running queries that included LIKE clauses, and since we opted not to install CouchDB-Lucene given both time constraints as well as the fact that the LIKE queries against SQL Server were pretty limited in scope and number, I thought I’d share what we came up with to do string matching in views in CouchDB.

This is by no means to suggest you should not use CouchDB-Lucene if you want true full-text searching against data in CouchDB, but in our case this was an acceptable compromise.

Matching Fields That Start With a String in Couch

SQL Equivalent: “WHERE field LIKE ‘foo%'”

Let’s assume I have a database called test and in that database I have documents that have fields of firstName and lastName. I want to write a view that will let me do wildcard matches against first names that begin with a string.

This turns out to be pretty simple given how keys work in CouchDB map functions. Since a view emits a key and a value and we can use start and end keys in our calls to CouchDB, we simply provide the string against which we want to match as our start key and some end key that will ensure we don’t get back more than what we’re wanting.

For example, let’s say I want to match all documents in my database that start with ‘Mat’ so I can retrieve all people with a first name of Matt, or Matthew, or Mathew, or Mat, or Mathias … you get the idea.

First I write a view that in its map function emits firstName as the key:

function (doc) {
  if (doc.firstName && doc.lastName) {
    emit(doc.firstName, doc);
  }
}

Assume that my design document is ‘people’ and that’s the map function for a view called ‘byFirstName.’ To call that view and get back only people with a first name staring with ‘Mat’ I use the following URL:

http://couch/test/_design/people/_view/byFirstName?startkey="Mat"&endkey="MatZ"

In case that wraps poorly in the blog post display, here’s just the start and end keys:

startkey="Mat"
endkey="MatZ"

That tells CouchDB to start its output for that view with anything that starts with Mat and end once it hits anything that starts with MatZ.

Matching Specific Strings Contained in Fields

SQL Equivalent: “WHERE field LIKE ‘%KnownString%'”

We had some use cases where users had canned queries (i.e. users can’t enter random search terms) that were looking for a specific term contained anywhere within a specific field. I say specific term here and in the example I use “KnownString” because if you know the string ahead of time this is a simple problem to solve, whereas ad hoc terms are more problematic, but I’ll address that below.

Remember that within CouchDB views you have full access to JavaScript, so solving this use case is simply a matter of using a regex to match against the known term.

Let’s say I want to pull all documents that have a bio field containing the term ‘CouchDB’:

function(doc) {
  if (doc.bio && doc.bio.toUpperCase().match(/bCOUCHDBb/)) {
    emit(doc._id, doc);
  }
}

Again, since I know the term ahead of time I can do a regex match against it quite easily in my view.

Matching Ad Hoc Strings Contained in Fields

SQL Equivalent: “WHERE field LIKE ‘%adHocSearchTerm%'”

Where things get tricky in CouchDB without using something like CouchDB-Lucene is matching ad hoc strings. “Tricky” is actually putting it mildly, because the real story is you can’t do this in CouchDB. So in use cases where people had code that had a search box into which users could type anything, we had to come up with another solution.

What I’ve found as I’ve been using CouchDB more and more is that it can shift things that you used to do in the database layer up into the application layer, and vice-versa. So in this case it was simply a matter of coming up with a view that pulled back a subset of documents into the application code, and then doing the matching there.

One caveat here is that since our database contains thousands of documents, it wasn’t really feasible to pull back all the documents in the database and then perform matching in the application layer. Since these documents all have a date associated with them, what we wound up doing is using date range as start and end keys as a way of reducing the number of documents we have to match against in the application. This wasn’t a huge burden on users and certainly will improve performance.

We wound up limiting documents returned by year (i.e. the users have to choose a year in which to search), which is enough of a range to not make things too annoying for users, but is also a small enough set of documents not to kill performance on the application side.

To call the view that uses date as its key, the URL params look like this to pull back all documents for 2011 in descending date order:

?startkey="2012/01/01"&endkey="2011/01/01"&descending=true

Remember that when you order descending you essentially flip the start and end keys around, hence why 2012/01/01 is used as the start key.

Once I have the documents back, I then deserialize the JSON into something usable by CFML and then loop over the documents to do my further refinement by search term.

Leaving out the subset controlled by date I described above, assuming I wanted to find all people with a bio field that contained the search term entered by a user on a form, the code winds up looking something like this:

<cfhttp url="http://server/test/_design/people/_view/hasBio"
        method="get"
        result="peopleJSON" />

<cfset peopleReturned =
        DeserializeJSON(peopleJSON.FileContent).rows />

<cfset matchingPeople = ArrayNew(1) />

<cfloop array="#peopleReturned#" index="person">
  <cfif FindNoCase(form.searchTerm, person.value.bio) neq 0>
    <cfset ArrayAppend(matchingPeople, person) />
  </cfif>
</cfloop>

What we wind up with there is the matchingPeople array will contain only the people who had the search term included in their bio field.

The big caveat here again is that if you have a huge number of documents you can get into trouble on the application side, so make sure and limit what you get back from CouchDB since you’ll wind up looping over all of those documents to do your search term matching.

Hope that helps others do some quick and dirty LIKE type queries in CouchDB. If there’s a better way to do any of these I’m all ears!

cf.Objective() NoSQL BOF

Heads up that on Friday night of cf.Objective() I'll be facilitating a BOF on using NoSQL databases with CFML, so if you're interested in things like CouchDB (my favorite thing on the planet as of late), MongoDB, or any of the numerous others please come to the BOF!

All skill levels are welcome so come to learn, come to share what you've done, or come to mock crazy people like myself who think the relational model is the biggest hoax ever perpetrated on the technology world and that we should have been using document-based datastores all along. Yes, that statement is meant to incite you to come to the BOF if you think I'm wrong, but I do believe it to a certain extent. 😉

When I say I'll be facilitating a BOF I mean just that–BOFs are meant to be highly participatory, free-form discussion forums, so while I'm happy to show off what I know about CouchDB, I'd personally love to learn more about some of the other NoSQL databases from people using those, and would love to have some heated discussions about NoSQL in general.

See you Friday night at 8 pm!

Grails + CouchDB #s2gx

Scott Davis – thirstyhead.com

NoSQL Databases in General

  • given the number of big companies using them, clearly they're ready to use today
  • time to re-examine our unnatural obsession for relational databases
  • rdbms has been around for 50 years now–well understood, great tooling, lots of information
  • rdbmses are silos
    • still good at what they do, but aren't necessarily well-suited to all data
  • as developers we're being forced to use sql to express something that's crucial to the success of your application
    • not our native language, kind of foreign when it comes down to it
  • we use orm to insulate ourselves from sql
    • express yourself in the native language of your choice instead of in sql

Is ORM State of the Art?

  • really just a bridge
  • why aren't there pure java or groovy datastores?
  • persistence is pretty uninteresting to developers
  • orm is a reasonable bridge, but a rather leaky abstraction as well
  • ted neward: orm is the vietnam of computer science
    • "[ORM] represents a quagmire which starts well, gets more complicated as time passes, and before long entraps its users in a commitment that has no clear demarcation point, no clear win conditions, and no clear exit strategy."

What Drew Me to CouchDB

  • what if i didn't have to bridge technologies anymore?
  • what if i could save my objects in their native format?
    • couchdb is actually a json datastore, but grails makes it trivial to transfer pogo <-> json
  • just need a thin translation layer

NoSQL Solutions

  • Google BigTable
  • mongoDB
  • CouchDB
  • Cassandra
    • "this is the future, but no one believes us"
  • each one of these are a bit different and each has their strengths and weaknesses
  • NoSQL = "not only SQL"
  • don't think of nosql solutions as just another database; truly different way to think about persistence
  • if you think of it as just another database, it'll be the worst database you've ever used
  • need to get out of the mindset of "spreadsheet" type format for data
  • start thinking more about the right tool for the job

CouchDB History

  • starting point was Lotus Notes
    • largely ahead of its time
    • document database
    • not brand-new stuff–ideas and foundation has been around for a very long time
  • Apache project

RDBMS vs. CouchDB

  • rdbms
    • row/column oriented
    • language: sql
    • insert, select, update, delete
  • CouchDB
    • if your data has a more vertical orientation as opposed to horizontal, starts to look more like attachments
    • email is a good example: to, from, body, attachment
    • language: javascript (map/reduce functions)
    • put, get, post, delete (REST)
    • "Django may be built for the Web, but CouchDB is built of the Web." — Jacob Kaplan-Moss, Django Developer
    • can build entire apps in CouchDB
  • Couch = acronym for "cluster of unreliable commodity hardware"
  • clustering is much more difficult to do clustering–couch was built from the ground up to be massively distributed, clusters out of the box
  • O'Reilly book available — free online

Using CouchDB With Grails

  • grails has native json support out of the box

import grails.converters.* class AlbumController { def scaffold = true def listAsJson = { render Album.list() as JSON } def listAsXml = { render Album.list() as XML } } CouchDB 101

  • json up and down
  • restful interface
  • no drivers since it's just http
  • written in erlang
    • incredibly fast
    • designed for scalability and parallel processing

Installing CouchDB

  • sudo apt-get install couchdb
  • windows installer available

Kicking the Tires

  • ping
    • curl http://localhost:5984
      {"couchdb":"Welcome","version":"1.0.1"}
    • can also hit this in a browser, but of course can't do a POST from a URL in a browser
  • get databases
  • create a database
  • delete a database
  • uses standard HTTP response codes, e.g. a 201 response code for a database create
  • web UI available – "Futon"
  • create a document
  • create a document from a file
  • URIs for documents are essentially your primary key–unique way of representing the document
  • don't have to create schemas — just start throwing documents at the database
  • documents get etags so they're very cache friendly
  • documents also get revisions–keeps tracks of multiple versions of the document
    • have to provide version number when updating
    • versioning numbers are revision number (integer), then -, then md5 hash of the document itself
    • can explicitly compress the database to get rid of old versions to reduce size of database
  • couch prefers uuids for the ids, but you can use anything you want
  • get UUID(s) from couch
  • to update a document, you'll get the latest version of the document, then do the update, then pass your changes back to couchdb which includes the revision number
  • one of the major things couchdb gives you since it's document based is that the data is accurate at that point in time
    • if the data changes in the future, in an rdbms the old document would get the new data

CouchDB With Grails

  • domain class–id and _rev as properties
  • can add couchdb stuff to Config.groovy to do stuff like create-drop for couchdb databases
  • add stuff to BootStrap.groovy
  • showing CouchDBService that has convenience methods around a lot of the URL calls to couch

Map/Reduce

  • in sql you say select firstname, lastname from foo (this is map) where state = 'NE' (this is reduce)
  • map and reduce are stored in 2 separate javascript functions

CouchDB NoSQL Database Ready for Production Use – NYTimes.com

Two major enhancements to CouchDB make it 1.0-worthy, said Chris Anderson, the chief financial officer and a founder of Couchio. One is the fact that performance of the software has been greatly improved. The other is its ability to work on Microsoft Windows machines. A lot of work was also put into stabilization of the software. 

Performance-wise, the new version has demonstrated a 300 percent increase in speed in reads and writes, as judged by internal benchmarking tests done by Couchio. The performance improvements were gained by optimizing the code, Anderson said. 

This is also the first release of CouchDB that can fully run on Windows computers, either the servers or desktops, Anderson said. Previous versions could run on Linux, and there is a version being developed for the Google Android smartphone operating system.

I’m really stoked that I may finally get to use CouchDB on a “real” application soon, so this is great timing. Congrats to the Couchio guys and the fantastic community around CouchDB.

CouchDB basics for PHP developers

However, every once in a while, you work on a project
where you probably think to yourself, “Why am I doing all this work?” The
project you’re working on contains very simple bits of data or data that’s
difficult to predict — you might get different data fields on
different days or even from transaction to transaction. If you were to
create a schema to predict what’s coming down the pike at you, you’d end
up with tables that have lots of empty fields or lots of mapping tables.

This is an excellent intro to CouchDB even if you aren’t a PHP developer.