Librarian Access ================ The librarian is a file storage service for launchpad. Conceptually similar to other file storage API's like S3, it is used to store binary or large content - bug attachments, package builds, images and so on. Content in the librarian can be exposed at different urls. To expose some content use a LibraryFileAlias. Private content is supported as well - for that tokens are added to permit access for a limited time by a client - each time a client attempts to dereference a private LibraryFileAlias a token is emitted. Deployment notes ================ (These may seem a bit out of place - they are, but they need to be written down somewhere, and the deployment choices inform the implementation choices). The basics are simple: The librarian talks to clients. However restricted file access makes things a little more complex. As the librarian itself doesn't do SSL processing, and we want restricted files to be kept confidential the librarian will need a hint from the SSL front end that SSL was in fact used. The semi standard header Front-End- Https can be used for this if we filter it in incoming requests from clients. setUp ----- >>> from canonical.database.sqlbase import session_store >>> from lp.services.librarian.model import TimeLimitedToken High Level ---------- >>> from StringIO import StringIO >>> from lp.services.librarian.interfaces import ( ... ILibraryFileAliasSet) >>> data = 'This is some data' We can create LibraryFileAliases using the ILibraryFileAliasSet utility. This name is a mouthful, but is consistent with the rest of our naming. >>> lfas = getUtility(ILibraryFileAliasSet) >>> from lp.services.librarian.interfaces import NEVER_EXPIRES >>> alias = lfas.create( ... 'text.txt', len(data), StringIO(data), 'text/plain', NEVER_EXPIRES ... ) >>> alias.mimetype u'text/plain' >>> alias.filename u'text.txt' We may wish to set an expiry timestamp on the file. The NEVER_EXPIRES constant means the file will never be removed from the Librarian, and because of this should probably never be used. >>> alias.expires == NEVER_EXPIRES True >>> alias = lfas.create( ... 'text.txt', len(data), StringIO(data), 'text/plain') The default expiry of None means the file will expire a few days after it is no longer referenced in the database. >>> alias.expires is None True The creation timestamp of the LibraryFileAlias is available in the date_created attribute. >>> alias.date_created datetime.datetime(...) We can retrieve the LibraryFileAlias we just created using its ID or sha1. >>> org_alias_id = alias.id >>> alias = lfas[org_alias_id] >>> alias.id == org_alias_id True >>> org_alias_id in [a.id for a in lfas.findBySHA1(alias.content.sha1)] True We can get its URL too >>> from lp.services.config import config >>> import re >>> re.search( ... r'^%s\d+/text.txt$' % config.librarian.download_url, ... alias.http_url ... ) is not None True Librarian also serves the same file through https, we use this for branding and similar inline-presented objects which would trigger security warnings on https pages otherwise. >>> re.search(r'^https://.+/\d+/text.txt$', alias.https_url) is not None True And we even have a convenient method which returns either the http URL or the https one, depending on a config value. >>> config.vhosts.use_https False >>> re.search( ... r'^%s\d+/text.txt$' % config.librarian.download_url, ... alias.getURL() ... ) is not None True >>> from textwrap import dedent >>> test_data = dedent(""" ... [librarian] ... use_https: true ... """) >>> config.push('test', test_data) >>> re.search( ... r'^https://.+/\d+/text.txt$', alias.https_url ... ) is not None True However, we can force the use of HTTP by setting the 'HTTP_X_SCHEME' header in the request to 'http', even when 'use_https' is True. >>> from zope.component import getMultiAdapter >>> from lp.services.webapp.servers import LaunchpadTestRequest >>> from urlparse import urlparse >>> request = LaunchpadTestRequest( ... environ={'REQUEST_METHOD': 'GET', 'HTTP_X_SCHEME': 'http'}) >>> view = getMultiAdapter((alias,request), name='+index') >>> view.initialize() >>> print urlparse(request.response.getHeader('Location'))[0] http When the incoming scheme is 'https' then the redirect scheme is unaffected. >>> request = LaunchpadTestRequest( ... environ={'REQUEST_METHOD': 'GET', 'HTTP_X_SCHEME': 'https'}) >>> view = getMultiAdapter((alias,request), name='+index') >>> view.initialize() >>> print urlparse(request.response.getHeader('Location'))[0] https Reset 'use_https' to its original state. >>> test_config_data = config.pop('test') However, we can't access its contents until we have committed >>> alias.open() Traceback (most recent call last): [...] LookupError: ... Once we commit the transaction, LibraryFileAliases can be accessed like files. >>> import transaction >>> transaction.commit() >>> alias.open() >>> alias.read() 'This is some data' >>> alias.close() We can also read it in chunks. >>> alias.open() >>> alias.read(2) 'Th' >>> alias.read(6) 'is is ' >>> alias.read() 'some data' >>> alias.close() If you don't want to read the file in chunks you can neglect to call open() and close(). >>> alias.read() 'This is some data' Each alias also has an expiry date associated with it, the default of None meaning the file will expire a few days after nothing references it any more: >>> alias.expires is None True Closing an alias repeatedly and/or without opening it beforehand is tolerated and will not result in exceptions being raised. >>> alias.close() >>> alias.close() Low Level --------- We can also use the ILibrarianClient Utility directly to store and access files in the Librarian. >>> from lp.services.librarian.interfaces.client import ILibrarianClient >>> client = getUtility(ILibrarianClient) >>> aid = client.addFile( ... 'text.txt', len(data), StringIO(data), 'text/plain', NEVER_EXPIRES ... ) >>> transaction.commit() >>> f = client.getFileByAlias(aid) >>> f.read() 'This is some data' >>> url = client.getURLForAlias(aid) >>> re.search( ... r'^%s\d+/text.txt$' % config.librarian.download_url, url ... ) is not None True When secure=True, the returned url has the id as part of the domain name and the protocol is https: >>> expected = r'^https://i%d\..+:\d+/%d/text.txt$' % (aid, aid) >>> found = client.getURLForAlias(aid, secure=True) >>> re.search(expected, found) is not None True Librarian reads are logged in the request timeline. >>> from lazr.restful.utils import get_current_browser_request >>> from lp.services.timeline.requesttimeline import get_request_timeline >>> request = get_current_browser_request() >>> timeline = get_request_timeline(request) >>> f = client.getFileByAlias(aid) >>> action = timeline.actions[-1] >>> action.category 'librarian-connection' >>> action.detail.endswith('/text.txt') True >>> _unused = f.read() >>> action = timeline.actions[-1] >>> action.category 'librarian-read' >>> action.detail.endswith('/text.txt') True At this level we can also reverse the transactional semantics by using the remoteAddFile instead of the addFile method. In this case, the database rows are added by the Librarian, which means that the file is downloadable immediately and will exist even if the client transaction rolls back. However, the records in the database will not be visible to the client until it begins a new transaction. >>> url = client.remoteAddFile( ... 'text.txt', len(data), StringIO(data), 'text/plain') >>> print url http://.../text.txt >>> from urllib2 import urlopen >>> urlopen(url).read() 'This is some data' If we abort the transaction, it is still in there >>> transaction.abort() >>> urlopen(url).read() 'This is some data' You can also set the expiry date on the file this way too: >>> from datetime import date, datetime >>> from pytz import utc >>> url = client.remoteAddFile( ... 'text.txt', len(data), StringIO(data), 'text/plain', ... expires=datetime(2005,9,1,12,0,0, tzinfo=utc)) >>> transaction.abort() To check the expiry is set, we need to extract the alias id from the URL. remoteAddFile deliberatly returns the URL instead of the alias id because, except for test cases, the URL is the only thing useful (because the client can't see the database records yet). >>> import re >>> match = re.search('/(\d+)/', url) >>> alias_id = int(match.group(1)) >>> alias = lfas[alias_id] >>> print alias.expires.isoformat() 2005-09-01T12:00:00+00:00 Restricted Librarian -------------------- Some files should not be generally available publicly. If you know the URL, any file can be retrieved directly from the librarian. For this reason, there is a restricted librarian to which access is restricted (at the system-level). This means that only Launchpad has direct access to the host. You use the IRestrictedLibrarianClient to access this librarian. >>> from zope.interface.verify import verifyObject >>> from lp.services.librarian.interfaces.client import IRestrictedLibrarianClient >>> restricted_client = getUtility(IRestrictedLibrarianClient) >>> verifyObject(IRestrictedLibrarianClient, restricted_client) True File alias uploaded through the restricted librarian have the restricted attribute set. >>> private_content = 'This is private data.' >>> private_file_id = restricted_client.addFile( ... 'private.txt', len(private_content), StringIO(private_content), ... 'text/plain') >>> file_alias = getUtility(ILibraryFileAliasSet)[private_file_id] >>> file_alias.restricted True >>> transaction.commit() >>> file_alias.open() >>> print file_alias.read() This is private data. >>> file_alias.close() Restricted files are accessible with HTTP on a private domain. >>> print file_alias.http_url http://.../private.txt >>> file_alias.http_url.startswith( ... config.librarian.restricted_download_url) True They can also be accessed externally using a time-limited token appended to their private_url. Possession of a token is sufficient to grant access to a file, regardless of who is logged in. getURL can be asked to provide such a token. >>> token_url = file_alias.getURL(include_token=True) >>> print token_url https://i...restricted.../private.txt?token=... >>> token_url.startswith('https://i%d.restricted.' % file_alias.id) True >>> private_path = TimeLimitedToken.url_to_token_path( ... file_alias.private_url) >>> token_url.endswith(session_store().find( ... TimeLimitedToken, path=private_path).any().token) True LibraryFileAliasView doesn't work on restricted files. This is a temporary measure until we're sure no restricted files leak into the traversal hierarchy. >>> view = getMultiAdapter((file_alias, request), name='+index') >>> view.initialize() Traceback (most recent call last): ... AssertionError If you try to retrieve this file through the standard ILibrarianClient, you'll get a DownloadFailed error. >>> client.getFileByAlias(private_file_id) Traceback (most recent call last): ... DownloadFailed: Alias ... cannot be downloaded from this client. >>> client.getURLForAlias(private_file_id) Traceback (most recent call last): ... DownloadFailed: Alias ... cannot be downloaded from this client. But using the restricted librarian will work: >>> restricted_client.getFileByAlias(private_file_id) >>> file_url = restricted_client.getURLForAlias(private_file_id) >>> print file_url http://.../private.txt Trying to access that file directly on the normal librarian will fail (by switching the port) >>> sneaky_url = file_url.replace( ... config.librarian.restricted_download_url, ... config.librarian.download_url) >>> urlopen(sneaky_url).read() Traceback (most recent call last): ... HTTPError: HTTP Error 404: Not Found But downloading it from the restricted host, will work. >>> print urlopen(file_url).read() This is private data. Trying to retrieve a non-restricted file from the restricted librarian also fails: >>> public_content = 'This is public data.' >>> public_file_id = getUtility(ILibrarianClient).addFile( ... 'public.txt', len(public_content), StringIO(public_content), ... 'text/plain') >>> file_alias = getUtility(ILibraryFileAliasSet)[public_file_id] >>> file_alias.restricted False >>> transaction.commit() >>> restricted_client.getURLForAlias(public_file_id) Traceback (most recent call last): ... DownloadFailed: ... >>> restricted_client.getFileByAlias(public_file_id) Traceback (most recent call last): ... DownloadFailed: ... The remoteAddFile() on the restricted client, also creates a restricted file: >>> url = restricted_client.remoteAddFile( ... 'another-private.txt', len(private_content), ... StringIO(private_content), 'text/plain') >>> print url http://.../another-private.txt >>> url.startswith(config.librarian.restricted_download_url) True The file can then immediately be retrieved: >>> print urlopen(url).read() This is private data. Another way to create a restricted file is by using the restricted parameter to ILibraryFileAliasSet: >>> restricted_file = getUtility(ILibraryFileAliasSet).create( ... 'yet-another-private.txt', len(private_content), ... StringIO(private_content), 'text/plain', restricted=True) >>> restricted_file.restricted True Even if one has the SHA1 of the file, searching the librarian for it will only return the file if it was in the same restriction space. So searching for the private content on the public librarian will fail: >>> transaction.commit() >>> search_query = "search?digest=%s" % restricted_file.content.sha1 >>> print urlopen(config.librarian.download_url + search_query).read() 0 But on the restricted server, this will work: >>> result = urlopen( ... config.librarian.restricted_download_url + search_query).read() >>> result = result.splitlines() >>> print result[0] 3 >>> sorted(file_path.split('/')[1] for file_path in result[1:]) ['another-private.txt', 'private.txt', 'yet-another-private.txt'] Odds and Sods ------------- An UploadFailed will be raised if you try to create a file with no content >>> client.addFile('test.txt', 0, StringIO('hello'), 'text/plain') Traceback (most recent call last): [...] UploadFailed: Invalid length: 0 If you really want a zero length file you can do it: >>> aid = client.addFile( ... 'test.txt', 0, StringIO(''), 'text/plain', allow_zero_length=True) >>> transaction.commit() >>> f = client.getFileByAlias(aid) >>> f.read() '' An AssertionError will be raised if the number of bytes that could be read from the file don't match the declared size. >>> client.addFile('test.txt', 42, StringIO(''), 'text/plain') Traceback (most recent call last): [...] AssertionError: size is 42, but 0 were read from the file Filenames with spaces in them work. >>> aid = client.addFile( ... 'hot dog', len(data), StringIO(data), 'text/plain') >>> transaction.commit() >>> f = client.getFileByAlias(aid) >>> f.read() 'This is some data' >>> url = client.getURLForAlias(aid) >>> re.search(r'/\d+/hot%20dog$', url) is not None True Unicode file names work. Note that the filename in the resulting URL is encoded as UTF-8. >>> aid = client.addFile( ... u'Yow\N{INTERROBANG}', len(data), StringIO(data), 'text/plain') >>> transaction.commit() >>> f = client.getFileByAlias(aid) >>> f.read() 'This is some data' >>> url = client.getURLForAlias(aid) >>> re.search(r'/\d+/Yow%E2%80%BD$', url) is not None True Files will get garbage collected on production systems as per LibrarianGarbageCollection. If you request the URL of a deleted file, you will be given None >>> alias = lfas[36] >>> alias.deleted True >>> alias.http_url is None True >>> alias.https_url is None True >>> alias.getURL() is None True >>> client.getURLForAlias(alias.id) is None True Default View ------------ A librarian file has a default view that should redirect to the download URL. >>> from zope.component import getMultiAdapter >>> from lp.services.webapp.servers import LaunchpadTestRequest >>> req = LaunchpadTestRequest() >>> alias = lfas.create( ... 'text2.txt', len(data), StringIO(data), 'text/plain', ... NEVER_EXPIRES) >>> transaction.commit() >>> lfa_view = getMultiAdapter((alias, req), name='+index') >>> lfa_view.initialize() >>> req.response.getHeader("Location") == alias.getURL() True File views setup ---------------- We need some files to test different ways of accessing them. >>> filename = 'public.txt' >>> content = 'PUBLIC' >>> public_file = getUtility(ILibraryFileAliasSet).create( ... filename, len(content), StringIO(content), 'text/plain', ... NEVER_EXPIRES, restricted=False) >>> filename = 'restricted.txt' >>> content = 'RESTRICTED' >>> restricted_file = getUtility(ILibraryFileAliasSet).create( ... filename, len(content), StringIO(content), 'text/plain', ... NEVER_EXPIRES, restricted=True) # Create a new LibraryFileAlias not referencing any LibraryFileContent # record. Such records are considered as being deleted. >>> from lp.services.librarian.model import LibraryFileAlias >>> from lp.services.webapp.interfaces import ( ... IStoreSelector, MAIN_STORE, MASTER_FLAVOR) >>> store = getUtility(IStoreSelector).get(MAIN_STORE, MASTER_FLAVOR) >>> deleted_file = LibraryFileAlias( ... content=None, filename='deleted.txt', mimetype='text/plain') >>> ignore = store.add(deleted_file) Commit the just-created files. >>> from canonical.database.sqlbase import commit >>> commit() >>> deleted_file = getUtility(ILibraryFileAliasSet)[deleted_file.id] >>> print deleted_file.deleted True Clear out existing tokens. >>> _ = session_store().find(TimeLimitedToken).remove() LibraryFileAliasMD5View ----------------------- The MD5 summary for a file can be downloaded. The text file contains the hash and file name. >>> view = create_view(public_file, '+md5') >>> print view.render() cd0c6092d6a6874f379fe4827ed1db8b public.txt >>> print view.request.response.getHeader('Content-type') text/plain Download counts --------------- The download counts for librarian files are stored in the LibraryFileDownloadCount table, broken down by day and country, but there's also a 'hits' attribute on ILibraryFileAlias, which holds the total number of times that file has been downloaded. The count starts at 0, and cannot be changed directly. >>> public_file.hits 0 >>> public_file.hits = 10 Traceback (most recent call last): ... ForbiddenAttribute: ... To change that, we have to use the updateDownloadCount() method, which takes care of creating/updating the necessary LibraryFileDownloadCount entries. >>> from lp.services.worlddata.interfaces.country import ICountrySet >>> country_set = getUtility(ICountrySet) >>> november_1st_2006 = date(2006, 11, 1) >>> brazil = country_set['BR'] >>> public_file.updateDownloadCount(november_1st_2006, brazil, count=1) >>> public_file.hits 1 This was the first hit for that file from Brazil on 2006 November first, so a new LibraryFileDownloadCount was created. >>> from lp.services.librarian.model import ( ... LibraryFileDownloadCount) >>> from storm.locals import Store >>> store = Store.of(public_file) >>> brazil_entry = store.find( ... LibraryFileDownloadCount, libraryfilealias=public_file, ... country=brazil, day=november_1st_2006).one() >>> brazil_entry.count 1 Below we simulate a hit from Japan on that same day, which will also create a new LibraryFileDownloadCount. >>> japan = country_set['JP'] >>> public_file.updateDownloadCount(november_1st_2006, japan, count=3) >>> public_file.hits 4 >>> japan_entry = store.find( ... LibraryFileDownloadCount, libraryfilealias=public_file, ... country=japan, day=november_1st_2006).one() >>> japan_entry.count 3 If there's another hit from Brazil on the same day, the existing entry will be updated. >>> public_file.updateDownloadCount(november_1st_2006, brazil, count=2) >>> public_file.hits 6 >>> brazil_entry.count 3 If the hit happened on a different day, a separate entry would be created. >>> november_2nd_2006 = date(2006, 11, 2) >>> public_file.updateDownloadCount(november_2nd_2006, brazil, count=10) >>> public_file.hits 16 >>> brazil_entry2 = store.find( ... LibraryFileDownloadCount, libraryfilealias=public_file, ... country=brazil, day=november_2nd_2006).one() >>> brazil_entry2.count 10 >>> last_downloaded_date = november_2nd_2006 Time to last download --------------------- The .last_downloaded property gives us the time delta from today to the day that file was last downloaded, or None if it's never been downloaded. >>> today = datetime.now(utc).date() >>> public_file.last_downloaded == today - last_downloaded_date True >>> content = 'something' >>> brand_new_file = getUtility(ILibraryFileAliasSet).create( ... 'new.txt', len(content), StringIO(content), 'text/plain', ... NEVER_EXPIRES, restricted=False) >>> print brand_new_file.last_downloaded None