Incremental mapreduce

So Google has their MapReduce, and the people behind CouchDB are throwing around their ideas. I spent some time thinking about incremental mapreduce around July, and it's time I type out that page full of scribbles.

First of all: I think the ideas thrown out by Damien above aren't really mapreduce, As Google Intended. The real power of mapreduce is in its inherent combination of parallelism and chainability, output of one mapreduce is input to another, each processing step can run massively in parallel with each other, etc. The proposed design is like a one-iteration retarded cousin of mapreduce.

With that bashing now done (sorry), here's what I was thinking:

The way I imagined building an incremental mapreduce mechanism, without storing the intermediate data and just recomputing chunks that are out-of-date (which would be lame), is to add one extra concept into the system: call it "demap". It will basically create "negative entries" for the old data. This is basically what Damien did by providing both the old and new data map calls, all the time, just said differently, and I think my way might make the average call a lot simpler. And I don't see any reason why my version wouldn't be parallelizable, chainable, and generally yummy.

Read the article

Tags:
  • mapreduce
  • couchdb
  • parallel
  • cluster
  • programming
  • database
  • search
2007-11-24T03:48+02:00