Remote File Repositories

Nov. 10, 2003

Mark J. Norton



Some applications based on OKI will have the ability to examine one or more data sources for content material. These data sources fall into three main categories (though there may be others): repositories, web sites, and file systems. Any collection of content is likely to be large, making the task of finding specific material more difficult. For this reason, it is desirable to support a search function on data sources.

Most repositories, such as Fedora, have the ability to search for assets including the collection. Typically, searching is conducted on metadata associated with the assets. Different kinds of repositories are likely to have different ways to format search queries, which must be taken into account.

Web servers based HTTP do not have intrinsic search capability. This lack was quickly addressed by the creation of search engines such as yahoo, excite!, altavista, google, and many others. In recently years, some of these engines have been augmented with programmatic interfaces which allow them to be used directly by web applications. Google is the most advanced of these having a web services implementation. As such, Google can be used to search for content on the worldwide web.

Finally, there is the matter of searching file systems. While most operation systems have a file search function (grep, Sherlock, etc.), these are not platform independent and are difficult to integrate into a java application.

The File System as a Repository - Indexed Searching

If a portion of a remote filing system could be designated as being relatively static (changes infrequently), then it could be treated like a digital repository. Files included in this repository would be registered and could (potentially) have metadata associated with them. These file/metadata pairs are then treated like assets in the repository. The process of registering them in the registry would create an entry in a (TBD) data base system, which then supports the use of SQL for search queries.

This approach needn’t be more complicated that scanning all files in a directory (including sub-directories) and indexing them. Creation of metadata could be automated from file name, owner, creation date, etc. While this is fairly limited, it would allow basic search capability.

Such a file-base repository would require the development of a database schema to represent file-assets, metadata representation, and design of a search specification (XQuery might be a good choice). This approach does have the advantage of being operating system independent, but is also sensitive to changes in the repository corpus. At a minimum, the following OSIDs would be required: dr, filing, shared, sql, and dbc.