0.7 is the new release of Tagsistant semantic filesystem, expanding 0.6 design. It is very different from the 0.2 release and needs some quick explanation to be used proficiently.
While 0.2 allowed the tagging of files only, 0.7 manages any kind of object, including devices, directories and so on. Some syntactic sugar has been added to permit this, like the @/ directory at the end of the queries (more on this later). Finally, a new deduplication feature reduces disk usage, preserving the tags and an internal caching layer speeds up the execution of queries already resolved.
First of all, you need to mount Tagsistant somewhere. We'll use the ~/myfiles directory, but you can change it:
$ tagsistant ~/myfiles Tagsistant (tagfs) v.0.7 Build: 20130323.000045 FUSE_USE_VERSION: 26
(c) 2006-2013 Tx0
For license informations, see ./tagsistant -h
Using default repository /home/tx0/.tagsistant
Using default plugin dir: /usr/local/lib/
By default Tagsistant places its internal repository in ~/.tagsistant. If you use just one Tagsistant filesystem, you can ignore this information. But if you plan to use more than one Tagsistant filesystem at the same time, please remember to provide a separate repository per each mountpoint, using the --repository argument, as in:
$ tagsistant --repository=~/.photo ~/myphoto
$ tagsistant --repository=~/.music ~/mymusic
Another thing Tagsistant does by default is using SQLite. If you feel comfortable with SQLite or just don't know what does it mean, feel free to skip the rest of this section. If instead you would like to use MySQL, change the command line as follows:
$ tagsistant --db=mysql:host:database:user:password ~/myfiles
Of course you must provide the database and the user inside MySQL before mounting Tagsistant.
You can omit the tokens after mysql, accepting default values, but if you specify a token you must specify the tokens on its left too. So, if you just write --db=mysql, you are using default values of localhost, tagsistant, tagsistant and tagsistant for the other tokens.
This schema gives you the flexibility to maintain just one standard tagsistant user inside MySQL with password tagsistant, but allowed to access many tagsistant databases. To connect to DB photos or music, you'll just change the db name like in: --db=mysql:localhost:photos or --db=mysql:localhost:music.
Now Tagsistant is managing the ~/myfiles directory. If you list its contents you'll find something similar to:
$ ls -l ~/myfiles
drwxr-xr-x 2 tx0 tx0 400K Mar 23 16:40 archive
drwxr-xr-x 2 tx0 tx0 400K Mar 23 16:40 relations
drwxr-xr-x 2 tx0 tx0 400K Mar 23 16:40 stats
drwxr-xr-x 2 tx0 tx0 400K Mar 23 16:40 tags
The four directories archive/, relations/, stats/ and tags/ are the interface you'll use to interact with Tagsistant. It's very important that you understand the meaning Tagsistant gives to each of them, since it's different and can be unexpected compared to the experience you have with traditional filesystem.
To make things easier, we'll start from the archive/ directory. This is where Tagsistant stores your files. It's here just to grant you a way to quickly access your files, but you should never use it under normal circumstances. It's also the only directory which behaves nearly as an usual directory, like your home directory, with the only exception of being read-only. You can't modify its content since Tagsistant expects to be the only allowed to do it.
The stats/ directory is devoted to report some internal statistics and configuration. More on this later.
The tags/ directory is where Tagsistant allows you to create tags and tag files, directories and other objects. In respect to the archive/ directory, this one is the most different, so we'll start by creating tags, which is the most intuitive operation. Remember that in Tagsistant a tag is just a directory created under the tags/ directory. Knowing this, all we have to do is use mkdir:
$ mkdir ~/myfiles/tags/video/
$ mkdir ~/myfiles/tags/scifi/
$ mkdir ~/myfiles/tags/startrek
What we have done here can be translated in English as: create a tag called video, a tag called scifi, and a tag called startrek. Remember that only directories under tags/ are considered tags and that tags can be created under the tags/ directory only.
Now we leave the tags/ directory to take a quick tour of another one, strictly related: the relations/ directory. This one is used to manage relations between tags. A relation always involves two tags and can be of two types:
For example, you can tell Tagsistant that scifi/ includes startrek/. To do it, you just need to:
$ mkdir ~/myfiles/relations/scifi/includes/startrek/
That's it. The first tag must be already present in the tags/ directory, but Tagsistant can create the second tag for you if it doesn't exists, like in:
$ mkdir ~/myfiles/relations/scifi/includes/starwars/
Now list the tags and you'll find the starwars/ tag:
$ ls ~/myfiles/tags/
scifi startrek starwars video
Let's wrap up what we have seen so far. We have created some tags (video, scifi, startrek and lastly starwars) and we have established two relations: scifi includes startrek and scifi includes starwars. Now we are ready to tag our files.
Tagging files happens when we copy files inside a tag directory. To test this, we use the movie "First Contact" and place it in the startrek tag-directory. The command is basically:
$ cp first_contact.avi ~/myfiles/tags/startrek/@/
You must have noticed the special @/ directory in the end. This is how you inform Tagsistant where the list of tags ends. Now let's check our file is where we put it:
$ ls ~/myfiles/tags/startrek/@/
As you see we must always put the @/ element at the end of the tag list. To make it a little more meaningful the role of @/, we can use more than one tag with the next movie:
$ cp the_wrath_of_khan.avi ~/myfiles/tags/startrek/video/@
In English: tag the movie "The Wrath of Khan" as both startrek and video.
But wait! We taught Tagsistant that scifi/ includes startrek/, so let's check if our files are in the scifi/ tag too:
$ ls ~/myfiles/tags/scifi/@/
Yes! The files are in both directories: startrek and scifi. That's because Tagsistant has an internal reasoner which uses the relations you provide to include files not directly tagged. Now let's tag something else:
$ cp the_empire_strikes_back.avi ~/myfiles/tags/starwars/@/
$ ls ~/myfiles/tags/scifi/@/
first_contact.avi the_wrath_of_khan.avi the_empire_strikes_back.avi
No files are directly tagged as scifi/ but since it includes both startrek/ and starwars/ now it features three files. The real benefit of using relations is reducing the length of your queries while looking for your files. Another way to get the same set of results would be searching all the files tagged startrek/ and all the files tagged as starwars/, with this query:
$ ls ~/myfiles/tags/startrek/+/starwars/@/
first_contact.avi the_wrath_of_khan.avi the_empire_strikes_back.avi
As you can see, the query scifi/@/ is much shorter than the query startrek/+/starwars/@/.
What about the +/ directory in the middle of the path? It means: get the result of the first part of the query (startrek/) and merge them with the result of the second part of the query (starwars/). This is totally different from writing two tags without a +/ in between. In that case you are looking for files that are tagged by both tags, like in:
$ ls ~/myfiles/tags/startrek/starwars/@/
which ends in no results (hopefully).
Let's do a small wrap up of the main concepts seen so far. Each directory under tags/ is a tag. If you copy a file under the tags/ directory it gets tagged. You can establish relations between tags using the relations/ directory. If tag A includes tag B all the files tagged as B will show up in tags/A/@/.
So far, so good. Now imagine that you tagged your mp3 library by band name and then organized the band tags by genre. In the end you included all the genre tags in music. Something like:
$ cp the_number_of_the_beast.mp3 ~/myfiles/tags/iron_maiden/@
[... other files too ...]
$ mkdir ~/myfiles/relations/heavy_metal/includes/iron_maiden/
[... other bands too ...]
$ mkdir ~/myfiles/relations/music/includes/heavy_metal/
[... other genres too ...]
$ ls ~/myfiles/tags/music/@
[... your whole library here ...]
Amazing! All your files in one place, without having to tag them as music one by one. You open your favorite music player, click on the_number_of_the_beast.mp3 and... the smooth timbre of a piano spreads in the room. What the hell happened to distorted guitars? Oh, sure, now you remember: that version is a tribute cover by a classical piano player. Better move it to ~/myfiles/tags/classical/piano/@:
$ mv ~/myfiles/tags/music/@/the_number_of_the_beast.mp3 ~/myfiles/tags/classical/piano/@
You give the move command and... the file is still there?!?!
Of course it is, because the reasoner knows that the music tag includes both classical and piano tags too, so your file still features in the result of tags/music/@. But how could you know the moving (retagging) happened? The reasoner prevents you from being sure.
The answer is: avoid the reasoner! If you end a query with the special @@/ marker, the reasoner doesn't get involved, so only files with an explicit match are returned. In example, if you list tags/music/@@/, then no files are listed, since music is by itself totally empty. The very same happens if you list tags/heavy_metal/@@/. But if you list tags/iron_maiden/@@/, all your Iron Maiden songs are there.
The @@/ marker is usually applied to single tag queries, like tags/iron_maiden/@@/ to ease the retagging process.
Now let's take a look inside the archive/ directory:
$ ls ~/myfiles/archive/
The numbers you see at the beginning of each file are called tagsistant inodes and are a unique identifier Tagsistant applies to each file it manages. This is nothing you should care about since inodes are visible just inside the archive/ directory, but with one important exception. If two different files, copied inside two or more different tags, have the same name (such as avatar.jpg), Tagsistant will apply the inode in front of both of them when it must list both as the result of one single search, like in:
$ cp ~/myblog/avatar.jpg ~/myfiles/tags/blog/@
$ cp ~/movies/covers/avatar.jpg ~/myfiles/tags/pictures/@
$ ls ~/myfiles/tags/blog/+/pictures/@
[... some results ...]
[... more results ...]
This is the only way Tagsistant lets you distinguish two different files with the same name when both are selected by a query.
Now you may ask: what happens if I put twice the same file in two separate folders? Will Tagsistant create two copies of the same file or what? The answer is: as soon as Tagsistant notices that two files with the same contents have been created, it deletes the second applying its tag set to the first one. So if you do this:
$ mkdir ~/myfiles/tags/movies/
$ mkdir ~/myfiles/tags/startrek/
$ cp first_contact.avi ~/myfiles/tags/movies/@/
$ cp first_contact.avi ~/myfiles/tags/startrek/@/
right after the end of the second copy Tagsistant will compare the content of the two files first_contact.avi, guess the second is a duplicate of the first, delete the second and tag the first also as startrek/. So you can now:
$ ls ~/myfiles/tags/startrek/movies/@/
Deduplication can currently be a bit rough and confusing for the user. If two identical files named A.jpg and B.jpg get copied in two directory called tag1/ and tag2/, Tagsistant will delete B.jpg and tag A.jpg as tag2/ too. So, the content of B.jpg (being identical to A.jpg) is really available under tag2/ too, but as A.jpg! The file tag2/B.jpg seems to have disappeared. This is something Tagsistant will address in a future release.
Tagsistant features a stack of autotagging plugins based on libextractor. Thanks to libextractor ability to extract metadata from a long list of file formats, Tagsistant is able to integrate the user tagging with some automatically provided information. Autotagging plugins are located in the src/plugins/ directory. Each plugin basically declares the mime-type it supports and sets a regular expression acting as a filter: if a key extracted by libextractor does not match it, that value is discarded and no tag is created. For example, a basic regular expression for the JPEG format could be "^(size|orientation)$" (which is actually the default one). The user can declare its preferred regular expressions in the repository.ini file, like in:
Version 0.5, 0.6 and 1.x of libextractor are supported. The list of plugins available so far includes application/xml, image/gif, text/html, image/jpeg, image/png, application/ogg and audio/mpeg. Information about plugin writing is provided here.
Untagging a file or another object is as simple as deleting it. Don't worry: an object is actually deleted from Tagsistant only when it's removed from the last tag. As an example consider this situation:
$ cp /some/file.txt ~/myfiles/docs/texts/@/
$ rm ~/myfiles/docs/@/file.txt
Here file.txt has been untagged from docs but it's still recorded in the database and tagged as texts.
To delete a tag just remove it from the tags/ directory:
$ rmdir ~/myfiles/useless_tag/
Please be careful: never remove anything recursively from the tags/ directory! Since the tags can be combined in any permutation, a recursive deletion will explore all the available tags and delete everything inside. That basically means that your entire filesystem will be emptied! This is also the reason why deleting a tag from a filemanager could prove to be impossible. Future releases of Tagsistant will introduce a new directory for tag management where tags will feature without their content to ease tag deletion.
The stats/ directory contains some special read-only files useful to get an idea of how Tagsistant is working. Let's see its content:
$ ls ~/myfiles/stats/
configuration connections objects relations tags
The configuration file contains the whole configuration Tagsistant is using, both compiled and runtime choosen:
$ cat ~/myfiles/stats/configuration
--> Command line options:
repository path: /home/tx0/.tagsistant
database options: mysql
run in foreground: 0
single threaded: 0
mount read-only: 0
[ ] boot
[ ] cache
[ ] file tree (readdir)
[ ] FUSE operations (open, read, write, symlink, ...)
[ ] low level
[ ] plugin
[ ] query parsing
[ ] reasoning
[ ] SQL queries
[ ] deduplication
--> Compile flags:
The objects, tags and relations files contain the total number of entities in the database:
$ cat ~/myfiles/stats/objects
# of objects: 1744
$ cat ~/myfiles/stats/tags
# of tags: 46
$ cat ~/myfiles/stats/relations
# of relations: 39
Finally, the connections file contains a the total number of active database connections:
$ cat ~/myfiles/stats/connections
# of MySQL open connections: 1
To unmount Tagsistant you can use the same command used for any other FUSE-based filesystem:
$ fusermount -u ~/myfiles
This command will kill the Tagsistant process and clear the mtab entry for you (if you don't understand, don't be scared, it's just stuff for the geeks).
While compiling Tagsistant you can choose to enable some experimental features by editing tagsistant.h in the src/ directory. The flags are those reported by the stats/configuration file in the compile flags section.
The only four flags you are supposed to tweak are:
Their purpose is to enable some caching layers to dramatically reduce the volume of SQL queries done. While the second and the third are stable and don't cause too much memory consumption, so being safely enabled on production, the first (the querytree cache) is quite experimental and can cause Tagsistant to exhaust memory during huge data loading.
The suggested configuration is:
Change it at your own risk.
I've recently (May 2013) done some tests on particular situations like a chain of tag relations where t1 is included by t2 which is included by t3 ... which is included by tN, with N being 40. The objects managed were 8352! The experience showed a very quick response time of 10.739s when Tagsistant was asked to:
ls ~/myfiles/tags/t1/t2/t3/t4/t5/t6/t7/t8/t9/t10/t11/t12/t13/t14/t15/t16/ t17/t18/t19/t20/t21/t22/t23/t24/t25/t26/t27/t28/t29/t30/t31/t32/t33/ t34/t35/t36/t37/t38/t39/@
After the first run, issuing the same query again got answered in just 3.598s. This query if of course very suboptimal since gives the same results of ~/myfiles/tags/t39/@, but my goal was to test how Tagsistant could behave under a lot of tags in the same query. The total files returned by the query were 937, which generated as much getattr (stat) calls to get result data (size, owner, permissions).
I've successfully tested Tagsistant on repositories containing 100G of data.
I hope this quick introduction to Tagsistant 0.7 will be enough to let you experiment with the software and find it useful. If you have any comment, I would be very happy to hear you.