Pythonic Filesystem APIs
aka Pythonic FS APIs part 2, the electric boogaloo
Story so far
My God, it's full of files
EuroPython 2008
Brief recap
Links in the abstract and in outline view.
- http://eagain.net/blog/2008/07/07/ep2008-pythonic-fs.html
- http://eagain.net/blog/2008/07/19/europython2008.html
- http://eagain.net/blog/2008/08/11/ep-talk-videos.html
- http://www.europython2008.eu/FilesystemSprint
So what's a filesystem
- well, it has files
- folders (or some such)
- open, read/write, close
- file handle, seek position
- unlink, rename, mmap?
Current Python API
Spread all over the place, built-ins and miscellaneous stdlib
file()
,open()
,file
objectsos.listdir()
,os.walk()
os.path.exists()
,os.chdir()
os.unlink()
,os.rename()
stat
,mmap
Why replace?
- Fugly
- Non-native filesystems
- Mockability
- Fault injection
- Security (untrusted input)
- Nicer API
- New features
We sprinted
Thanks to everyone there!
We organized ourselves (a bit)
- https://github.com/tv42/fs (removed links that have bitrotted)
We wrote code
- basic POSIX filesystem access, including many corner cases
- in-memory virtual filesystem
- piles of tests: 800 + 1300 lines
We wrote more code
- multiplexing
- file-level copy on write
(not vouching for these personally, yet)
Then I wrote more
gitfs
: 1700 + 2100 LoC- about a year of production use
Examples
p = self.fs.join("path/to/myfile")
with p.open() as f:
while True:
data = f.read(8192)
if not data:
break
yield data
Examples
p = filesystem.path("/my/safe/area")
p = p.child(user_input)
Examples
for p in path:
print p.name(), p.size()
Examples
for (cur, dirs, files) in top.walk():
dirs[:] = [
d for d in dirs
if d.name() not in IGNORE
]
for p in files:
print p, p.size()
Serialize atomically
import json
import os
def serialize(path, data):
tmp_name = '%s.%d.tmp' % (
path.name(),
os.getpid(),
)
tmp = path.parent().child(tmp_name)
with tmp.open('w') as f:
json.dump(obj=data, fp=f)
tmp.rename(path)
Unit tests
import fudge
import nose
import re
from fudge.inspector import arg
from serialize_json import serialize
@nose.with_setup(fudge.clear_expectations)
@fudge.with_fakes
def test_serialize():
path = fudge.Fake('path')
path.remember_order()
path.expects('name').with_args().returns('quux.thud')
path_parent = fudge.Fake('path_parent')
path_parent.remember_order()
path.expects('parent').with_args().returns(path_parent)
Unit tests
path_tmp = fudge.Fake('path_tmp')
path_tmp.remember_order()
TMP_RE = re.compile(r'^quux\.thud\.\d+\.tmp$')
path_parent.expects('child').with_args(
arg.passes_test(TMP_RE.match),
).returns(path_tmp)
file_tmp = fudge.Fake('file')
file_tmp.remember_order()
path_tmp.expects('open').with_args('w').returns(file_tmp)
file_ctx = fudge.Fake('file')
file_ctx.remember_order()
file_tmp.expects('__enter__').with_args().returns(file_ctx)
Unit tests
file_ctx.expects('write').with_args('{')
file_ctx.next_call().with_args('"foo"')
file_ctx.next_call().with_args(': ')
file_ctx.next_call().with_args('"bar"')
file_ctx.next_call().with_args('}')
file_tmp.expects('__exit__').with_args(None, None, None)
path_tmp.expects('rename').with_args(path)
serialize(path, {'foo':'bar'})
Production use
#!/usr/bin/python
import filesystem
from serialize_json import serialize
serialize(
filesystem.path('output.json'),
{'answer': 42},
)
Look mom, no monkeypatching!
and no temp dirs either
Clumsy
So use convenience libs!
def test_wishful_thinking():
p = fudgefs.FakeFS()
tmp = p.file(regex=r'^foo\.bar\.\d+\.tmp$')
tmp.writes('{"foo": "bar"}')
p.rename(tmp, 'quux.thud')
serialize(p, {'foo':'bar'})
Transactions
repo = gitfs.repo.Repository(path)
with repo.transaction() as root:
with root.child('greet').open('w') as f:
f.write('hello, world\n')
Brief overview of gitfs
+------------------------------------+
| gitfs.repo |
+---+----------------+---------------+
| | gitfs.readonly | gitfs.indexfs |
+---+----------------+---------------+
| gitfs.commands |
+------------------------------------+
| git plumbing |
+------------------------------------+
HDFS (vapor)
- Hadoop distributed filesystem
- (C to) JNI, needs the jars, yuck
pyhdfs
= raw ctypeshadoopfs
= wrap to pythonic fs
HDFS example
fs = hadoopfs.HadoopFS('myserver')
with fs as root:
big = root.child('big')
with big.open('w') as f:
while i in xrange(1e6):
f.write('x'*1000+'\n')
Features
- base
- writable
- system
- snapshots
- transactions
- posix_owner
- posix_stat
- symlinks
- hardlinks
Call to action
- action()
- brains
- docs, sphinx (recipe!)
- features (ABC? z.i?)
- exceptions
- PEP? Py3k?
- your own fs!
Thank You
Questions? Opinions? Rants?
Find me today or tomorrow to talk more.
Slides etc up on eagain.net