Pythonic Filesystem APIs

aka Pythonic FS APIs part 2, the electric boogaloo

Story so far

My God, it's full of files

EuroPython 2008

Brief recap

Links in the abstract and in outline view.

So what's a filesystem

  • well, it has files
  • folders (or some such)
  • open, read/write, close
  • file handle, seek position
  • unlink, rename, mmap?

Current Python API

Spread all over the place, built-ins and miscellaneous stdlib

  • file(), open(), file objects
  • os.listdir(), os.walk()
  • os.path.exists(), os.chdir()
  • os.unlink(), os.rename()
  • stat, mmap

Why replace?

  • Fugly
  • Non-native filesystems
  • Mockability
  • Fault injection
  • Security (untrusted input)
  • Nicer API
  • New features

We sprinted

Thanks to everyone there!

We organized ourselves (a bit)

We wrote code

  • basic POSIX filesystem access, including many corner cases
  • in-memory virtual filesystem
  • piles of tests: 800 + 1300 lines

We wrote more code

  • multiplexing
  • file-level copy on write

(not vouching for these personally, yet)

Then I wrote more

  • gitfs: 1700 + 2100 LoC
  • about a year of production use

Examples

p = self.fs.join("path/to/myfile")
with p.open() as f:
	while True:
		data = f.read(8192)
	if not data:
		break
	yield data

Examples

p = filesystem.path("/my/safe/area")
p = p.child(user_input)

Examples

for p in path:
	print p.name(), p.size()

Examples

for (cur, dirs, files) in top.walk():
	dirs[:] = [
	d for d in dirs
	if d.name() not in IGNORE
	]
	for p in files:
	print p, p.size()

Serialize atomically

import json
import os

def serialize(path, data):
    tmp_name = '%s.%d.tmp' % (
        path.name(),
        os.getpid(),
        )
    tmp = path.parent().child(tmp_name)
    with tmp.open('w') as f:
        json.dump(obj=data, fp=f)
    tmp.rename(path)

Unit tests

import fudge
import nose
import re
from fudge.inspector import arg

from serialize_json import serialize

@nose.with_setup(fudge.clear_expectations)
@fudge.with_fakes
def test_serialize():
    path = fudge.Fake('path')
    path.remember_order()
    path.expects('name').with_args().returns('quux.thud')

    path_parent = fudge.Fake('path_parent')
    path_parent.remember_order()
    path.expects('parent').with_args().returns(path_parent)

Unit tests

    path_tmp = fudge.Fake('path_tmp')
    path_tmp.remember_order()
    TMP_RE = re.compile(r'^quux\.thud\.\d+\.tmp$')
    path_parent.expects('child').with_args(
        arg.passes_test(TMP_RE.match),
        ).returns(path_tmp)

    file_tmp = fudge.Fake('file')
    file_tmp.remember_order()
    path_tmp.expects('open').with_args('w').returns(file_tmp)

    file_ctx = fudge.Fake('file')
    file_ctx.remember_order()
    file_tmp.expects('__enter__').with_args().returns(file_ctx)

Unit tests

    file_ctx.expects('write').with_args('{')
    file_ctx.next_call().with_args('"foo"')
    file_ctx.next_call().with_args(': ')
    file_ctx.next_call().with_args('"bar"')
    file_ctx.next_call().with_args('}')

    file_tmp.expects('__exit__').with_args(None, None, None)

    path_tmp.expects('rename').with_args(path)

    serialize(path, {'foo':'bar'})

Production use

#!/usr/bin/python
import filesystem
from serialize_json import serialize

serialize(
    filesystem.path('output.json'),
    {'answer': 42},
    )

Look mom, no monkeypatching!

and no temp dirs either

Clumsy

So use convenience libs!

def test_wishful_thinking():
    p = fudgefs.FakeFS()
    tmp = p.file(regex=r'^foo\.bar\.\d+\.tmp$')
    tmp.writes('{"foo": "bar"}')
    p.rename(tmp, 'quux.thud')
    serialize(p, {'foo':'bar'})

Transactions

repo = gitfs.repo.Repository(path)
with repo.transaction() as root:
	with root.child('greet').open('w') as f:
		f.write('hello, world\n')

Brief overview of gitfs

+------------------------------------+
|            gitfs.repo              |
+---+----------------+---------------+
|   | gitfs.readonly | gitfs.indexfs |
+---+----------------+---------------+
|          gitfs.commands            |
+------------------------------------+
|           git plumbing             |
+------------------------------------+

HDFS (vapor)

  • Hadoop distributed filesystem
  • (C to) JNI, needs the jars, yuck
  • pyhdfs = raw ctypes
  • hadoopfs = wrap to pythonic fs

HDFS example

fs = hadoopfs.HadoopFS('myserver')
with fs as root:
	big = root.child('big')
	with big.open('w') as f:
	while i in xrange(1e6):
			f.write('x'*1000+'\n')

Features

  • base
  • writable
  • system
  • snapshots
  • transactions
  • posix_owner
  • posix_stat
  • symlinks
  • hardlinks

Call to action

  • action()
  • brains
  • docs, sphinx (recipe!)
  • features (ABC? z.i?)
  • exceptions
  • PEP? Py3k?
  • your own fs!

Thank You

Questions? Opinions? Rants?

Find me today or tomorrow to talk more.

Slides etc up on eagain.net