This is a work in progress document about next release. I'll try to explain my vision for the future and to take in all the suggestion everyone wants to give.
What Tagsistant should be able to do:
- be a personal semantic tool or a self contained semantic engine
- tag any object: file, directory, symlink, fifo, device, socket, ...
- provide a mechanism for extended querying, including merging more than one set of tags
- reasoning on tag relations, like tag1 includes tag2 and tag3 is an alias (is equivalent) of tag4
- autotagging files using a stack of provided plugins able to parse natural informations (tags from HTML and XML, Exif from Jpeg, audio tags from MP3 and Ogg-Vorbis, ...)
- be self contained: no external tools should be required to tag objects, query the db, manage tags and relations, ... This will allow for remote usage through NFS and Samba, will allow easy inclusion as a semantic engine on remote servers, like web servers, or inside existing applications, because it does not require bindings or APIs but just plain old libc functions like open(), readdir() and so on.
- be easy and intuitive for the newcomer and the first time usage
- interact naturally with the shell (see the "=" escape problem when composing a query)
Experience from the past
In the following examples I'll assume that ~/T02 is a Tagsistant 0.2 mountpoint while ~/T04 is a Tagsistant 0.4 mountpoint. Each has two tags called tag1 and tag2 already defined. Two files called ~/report.txt (containing "First report") and ~/meeting/report.txt (containing "Second report") will be used as test objects.
The filename duplication problem
We want to tag two separate files sharing the same name, lets say ~/report.txt and ~/meeting/report.txt.
$ cp ~/report.txt ~/T02/tag1/
$ cp ~/meeting/report.txt ~/T02/tag2/
$ cat ~/T02/tag1/report.txt
One of the greatest limitations in T02 was the impossiblity to have more than one file with the same name. Since the first cp created a file named report.txt and no more than one file can have the same name inside a T02 filesystem, the second cp overwrites the content of the first file. This is something a Tagsistant power user can understand, but is a complete violation of POSIX semantics and will be undoubtly perceived as an error by the occasional user. In addition, even a Tagsistant power user can get annoyed by this behaviour or easily loose track of its files or, even worst, loose file content.
T04 tried to solve this problem by prepending a unique number to each object. So our report.txt files became 1__report.txt and 2__report.txt. This is somehow a compromise between usability and features but fails in being compatible with established software, because an object would be named differently from how it was supposed to be:
$ cp ~/report.txt ~/T04/tags/tag1/@
$ cp ~/meeting/report.txt ~/T04/tags/tag2/@
$ ls ~/T04/tags/tag2/@
$ cat ~/T04/tags/tag1/@/1___report.txt
Pro: each file contains its own separate content.
Con: the user must find its file after it got created. That's impossible for a software like a filemanager which correctly expects to find a file at the path where it was created.
So, to mitigate this problem, a trasparent alias layer has been introduced. The layer records how a file has been created and maps it on the real name. So:
$ cat ~/T04/tags/tag1/@/report.txt
still gets to the right content. But this layer is totally invisible to the user and is here just to help existing software. Surely a suboptimal solution.
There's more than just plain files in the Universe
We want to tag a directory or a symlink to an external file.
$ mkdir ~/T02/real_dir
$ ls ~/T02/real_dir/
T02 does NOT support custom directories inside its filesystem because each directory is supposed to be a tag.
$ mkdir ~/T04/tags/tag1/@/real_dir
$ ls ~/T04/tags/tag1/@/
T04 is somehow a step forward because it allows the creation of custom directories, but with one limitation. Since Tagsistant can't distinguish between a tag-dir and a real-dir, the "=" sign has been introduced to mark the end of the query. I'm pretty sure the "@" marker can't be avoided. Even supposing to do a heavy analysis of each directory of a query to check if it is a tag until one non-tag directory is found (and with thousands of tags this can be very expensive), we can still incur another problem: a tag-directory and a real-directory can't share the same name.
How Tagsistant will do it
The directory scheme
Tagsistant 0.6 will have a slightly expanded scheme of directories:
- archive/ contains the tagged objects with their real names prepended with the object ID just when that's necessary to disambiguate two or more objects
- tags/ contains the available tags in a flat list, no subdir, no more tagging here
- tagging/ is where objectes can be tagged using known tools (cp, mv, ln, filemanagers) using the simple syntax tagging/tag1/tag2/tag3/@
- search/ is where files can be looked up, using the complex syntax search/tag1/tag2/+/tag1/tag3/+tag6/@/
- relations/ is where tag relations can be managed in a three-level hierarchy like relations/tag1/relation_type/tag2/
- stats/ contains just some statistics on filesystem usage, tagged objects, number of tags and so on (planned in 0.4 but not implemented so far)
In this cookbook I'll assume that a Tagsistant 0.6 filesystem has been mounted in a directory called ~/snap.
Create a tag called rock
$ mkdir ~/snap/tags/rock
If the tag exists, EEXIST is returned.
Tag a file as rock
$ cp ~/Music/back_in_black.mp3 ~/snap/tagging/rock/@
$ ln -s ~/Music/back_in_black.mp3 ~/snap/tagging/rock/@
If an object named back_in_black.mp3 already exists in the query results, EEXSIST is returned, otherwise it is created.
Retag the same file as AC-DC
$ mkdir ~/snap/tags/AC-DC/
$ cp ~/Music/back_in_black.mp3 ~/snap/tagging/AC-DC/@
This creates a second copy of back_in_black.mp3 with the AC-DC tag on it. Tagsistant will deduplicate the file later, comparing the SHA1 signature of the files. If both have the same signature, the second copy will be deleted and its tag set will be applied to the first copy.
Look for all rock tagged files also tagged as AC-DC or Metallica
$ ls ~/snap/search/rock/AC-DC/+/rock/Metallica/@
Of course the rock tag is repeated twice. In a concise logical expression that would be rock and (ac-dc or metallica) but we should add parenthesis to Tagsistant query language and that would make it a little harder to use on the shell. But is something we can work on.
Look for all the tags assigned to a file
$ cat ~/snap/tagged/filename.txt
This is something still to be refined. If the file is an MP3, opening it like a text file to see its content sounds a little strange.