Pythonic Filesystem APIs

aka Pythonic FS APIs part 2, the electric boogaloo

Story so far

My God, it's full of files

EuroPython 2008

So what's a filesystem

  • well, it has files
  • folders (or some such)
  • open, read/write, close
  • file handle, seek position
  • unlink, rename, mmap?

Current Python API

Spread all over the place, built-ins and miscellaneous stdlib

  • file(), open(), file objects
  • os.listdir(), os.walk()
  • os.path.exists(), os.chdir()
  • os.unlink(), os.rename()
  • stat, mmap

Why replace?

  • Fugly
  • Non-native filesystems
  • Mockability
  • Fault injection
  • Security (untrusted input)
  • Nicer API
  • New features

We sprinted

Thanks to everyone there!

We organized ourselves (a bit)

We wrote code

  • basic POSIX filesystem access, including many corner cases
  • in-memory virtual filesystem
  • piles of tests: 800 + 1300 lines

We wrote more code

  • multiplexing
  • file-level copy on write

(not vouching for these personally, yet)

Then I wrote more

  • gitfs: 1700 + 2100 LoC
  • about a year of production use

Examples

p = self.fs.join("path/to/myfile")
with p.open() as f:
    while True:
        data = f.read(8192)
        if not data:
            break
        yield data

Examples

p = filesystem.path("/my/safe/area")
p = p.child(user_input)

Examples

for p in path:
    print p.name(), p.size()

Examples

for (cur, dirs, files) in top.walk():
    dirs[:] = [
        d for d in dirs
        if d.name() not in IGNORE
        ]
    for p in files:
        print p, p.size()

Serialize atomically

import json
import os

def serialize(path, data):
    tmp_name = '%s.%d.tmp' % (
        path.name(),
        os.getpid(),
        )
    tmp = path.parent().child(tmp_name)
    with tmp.open('w') as f:
        json.dump(obj=data, fp=f)
    tmp.rename(path)

Unit tests

import fudge
import nose
import re
from fudge.inspector import arg

from serialize_json import serialize

@nose.with_setup(fudge.clear_expectations)
@fudge.with_fakes
def test_serialize():
    path = fudge.Fake('path')
    path.remember_order()
    path.expects('name').with_args().returns('quux.thud')

    path_parent = fudge.Fake('path_parent')
    path_parent.remember_order()
    path.expects('parent').with_args().returns(path_parent)

Unit tests

    path_tmp = fudge.Fake('path_tmp')
    path_tmp.remember_order()
    TMP_RE = re.compile(r'^quux\.thud\.\d+\.tmp$')
    path_parent.expects('child').with_args(
        arg.passes_test(TMP_RE.match),
        ).returns(path_tmp)

    file_tmp = fudge.Fake('file')
    file_tmp.remember_order()
    path_tmp.expects('open').with_args('w').returns(file_tmp)

    file_ctx = fudge.Fake('file')
    file_ctx.remember_order()
    file_tmp.expects('__enter__').with_args().returns(file_ctx)

Unit tests

    file_ctx.expects('write').with_args('{')
    file_ctx.next_call().with_args('"foo"')
    file_ctx.next_call().with_args(': ')
    file_ctx.next_call().with_args('"bar"')
    file_ctx.next_call().with_args('}')

    file_tmp.expects('__exit__').with_args(None, None, None)

    path_tmp.expects('rename').with_args(path)

    serialize(path, {'foo':'bar'})

Production use

#!/usr/bin/python
import filesystem
from serialize_json import serialize

serialize(
    filesystem.path('output.json'),
    {'answer': 42},
    )

Look mom, no monkeypatching!

and no temp dirs either

Clumsy

So use convenience libs!

def test_wishful_thinking():
    p = fudgefs.FakeFS()
    tmp = p.file(regex=r'^foo\.bar\.\d+\.tmp$')
    tmp.writes('{"foo": "bar"}')
    p.rename(tmp, 'quux.thud')
    serialize(p, {'foo':'bar'})

Transactions

repo = gitfs.repo.Repository(path)
with repo.transaction() as root:
    with root.child('greet').open('w') as f:
        f.write('hello, world\n')

Brief overview of gitfs

gitfs.repo
  gitfs.readonly gitfs.indexfs
gitfs.commands
git plumbing

HDFS (vapor)

  • Hadoop distributed filesystem
  • (C to) JNI, needs the jars, yuck
  • pyhdfs = raw ctypes
  • hadoopfs = wrap to pythonic fs

HDFS example

fs = hadoopfs.HadoopFS('myserver')
with fs as root:
    big = root.child('big')
    with big.open('w') as f:
        while i in xrange(1e6):
            f.write('x'*1000+'\n')

Features

  • base
  • writable
  • system
  • snapshots
  • transactions
  • posix_owner
  • posix_stat
  • symlinks
  • hardlinks

Call to action

  • action()
  • brains
  • docs, sphinx (recipe!)
  • features (ABC? z.i?)
  • exceptions
  • PEP? Py3k?
  • your own fs!

Thank You

Questions? Opinions? Rants?

Find me today or tomorrow to talk more.

Slides etc up on eagain.net