What's next?

Tx0

June 06, 2017

What will feature in the next release of Tagsistant? This page is here to share some ideas with you.

1. Deduplication improvements

Deduplication is delivering a great saving of disk space, but current Tagsistant implementation has one main flaw. When two or more files are created with the same content but different filenames, deduplication will delete all the files except the first and move all the applied tags on the first file. This can be confusing for the user. This is the scenario:

$ cp fileA.txt ~/myfiles/tags/t1/@
$ cp fileB.txt ~/myfiles/tags/t2/@
[ ... deduplication happens, fileB.txt gets deleted
  and fileA.txt gets tagged as t2 too ... ]
$ ls ~/myfiles/tags/t2/@
fileA.txt

The user would expect to find fileB.txt, but fileA.txt is returned insted! To solve this issue, an abstraction layer must be inserted between the inode which identify a file and the names which points to the inode from inside different directories.

2. Machine (or faceted) tags (since revision r446)

Tagsistant tag namespace is currently flat. If you want to tag something like “year 2013”, your only option is to create a tag called year_2013 or simply 2013. This will lead very quickly to a namespace overcrowding. The tags/ directory will be hogged by thousands of tags which can be hardly managed and understood.

Machine tags, or faceted tags, are a more complex kind of tag formed by a namespace, a predicate and a value. In plain writing, as done on many social tagging on-line platforms, the “year 2013” concept would translate to “time:year=2013”, which stands for a tag in time namespace referring to year 2013.

A possible Tagsistant implementation would use three adjacent directories:

myfiles/tags/time:/year/2013/@/...

3. Archive/ optimization (since revision r454)

Currently Tagsistant saves the objects in one single directory called archive/ inside its repository. It’s a known fact that filesystems can become slow when scanning a directory containing a lot of files (tens of thousands or more). The more widespread countermeasure is to divide the directory content in other directories, using a defined schema, like using the characters of a hash signature:

Filename by MD5 signature: 95a5a06e937cbbb554fd483e62c1c156.data File path on disk: 9/5/a/95a5a06e937cbbb554fd483e62c1c156.data

Tagsistant could use the reverse of the inode:

Inode based filename: 3429___document.xml File path on disk: archive/9/2/4/3429___document.xml

Using the reverse of the inode avoids distribution imbalance.

4. Selective disabling the reasoner (since revision r399)

Sometimes disabling the reasoner could be very useful, especially when doing housekeeping or retagging. Let’s say you have tagged all your music collection by band names and then set relations from music to include all the band names. When you list music contents:

$ ls myfiles/tags/music/@

you’ll end up with a lot of objects which are not directly tagged as “music”. You don’t have any way to be sure your files are not tagged by music but by the proper band name tag.

Disabling reasoning could be done with a special query ending mark, like “@@”. In:

$ ls myfiles/tags/music/@@

just files tagged as “music” are shown, excluding the files tagged by band names.

5. Memory management

Since Tagsistant 0.6 introduced a caching layer for tagsistant_querytree objects, its memory occupation can increase during runtime. Currently the only way to free long-time unused tagsistant_querytree objects is unmounting and mounting again, but that’s suboptimal since it wipes out the whole cache, losing useful tagsistant_querytree objects too and force an interruption in filesystem availability.

A proper memory management policy would delete longtime unused objects only, keeping allocation profile as low as possible without sacrificing performance.

6. Lookup all the tags applied to an object (since revision r499)

A useful feature Tagsistant lacks is a way to list all the tags applied to a given object. This could be implemented as a new tagged/ branch as in:

$ ls ~/myfiles/tagged/filename.txt
tag1
tag2
tag3
...

But this would be very confusing to the user because directories would be presented as files. This is the scenario:

$ mkdir ~/myfiles/store/tag1/@/real_dir
$ cat ~/myfiles/tagged/real_dir
tag1

real_dir is presented as a file inside tagged/ because the user must be able to read its content (the list of tags) but since it’s a directory, this could be confusing. This would also list all the files in a flat list, which is undesirable.

A better approach could be the use of a hidden suffix like .tags which appended to an object name would refer to a flat file containing the tags applied to the object. This is the scenario:

$ cp ~/story.txt ~/myfiles/store/novel/@/
$ cp ~/story.txt ~/myfiles/store/writing/@/
$ cat ~/myfiles/store/novel/@/story.txt.tags
novel
writing
$ mkdir ~/myfiles/store/writing/@/drafts
$ cat ~/myfiles/store/ALL/@/drafts.txt
writing

.txt files would not be listed to avoid hogging the interface but could be easily read even through a file manager by manually adding the suffix to the current path.

A drawback of this approach is that a file ending with “.tags” will be considered a special file and will lead to errors. The suffix should be configurable with a command line option and should be saved inside the repository.ini file.

7. Avoid the tag “key” (since revision r497)

This is a mere bug fix. When the user creates a tag called “key” the tag can’t be used and can’t be deleted too. This is due to syntactic collision with SQL, since “key” is a reserved word. Tagsistant must manage this or prevent the creation of the “key” tag.

8. Better time-tags management

Currently Tagsistant allows the creation of triple tags like time:/year/eq/2014 or time:/month/eq/August. While this allows a fine management of time in queries, it’s also very verbose. To select a specific day the user has to write: “time:/year/eq/2014/time:/month/eq/August/time:/day/eq/27”. It would be very useful to have a compact syntax like time:/date/eq/2014-08-27/ or time:/date/eq/2014-August-27/. The compact syntax would be a macro-layer on top of the verbose syntax, allowing the user to recycle the information brought by the verbose form.

9. Named pipe or network protocol for a comfortable GUI

To ease the management of files and tags (especially the re-tagging of a lot of files) a GUI would be very useful. Even provided that a file manager is the natural GUI of every filesystem, a dedicated GUI able to communicate with Tagsistant with a dedicated protocol would perform a better job, sending to Tagsistant messages like: add tag “holiday” to file /store/photo/@@/3465.jpg.

This would require:

the creation of a named pipe or network port to allow interprocess communication
the design of a dedicated protocol
the design of a dedicated GUI and its implementation

10. Better autotagging plugins

One of the greatest benefits Tagsistant can bring to its users is through its autotagging plugins that ease the tagging process by automatically mining informations out of files. This is true as long as a reach and versatile set of autotagging plugins is provided. Currently (2014-05-03), only the image/jpeg and image/* mime types have some real support. This must grow in future releases.

11. New relation: excludes (since revision r498)

One of the relations that can bind two tags is the exclusion one: this just means that a tag can exclude another one. For example: relations/blindness/excludes/image or relations/celiac_disease/excludes/wheat_flour. This kind of relations could be implemented as the “includes” relation with the “negate” flag raised to one.

12. Reporting query syntax errors (since revision r507)

When a query contains an error is internally marked as invalid. This prevents query execution but the user does not get any clue about what’s happening. Tagsistant should instead create a metafile containing a message depicting the error. The message could be saved in a hash table using the query itself as key. This will prevent recreation of the message and will permit the read() call to find the message after the readdir() call has created it and listed as the only query result.

Example:

$ ls ~/myfiles/store/{/startrek/{/video/audio/}/@@
syntax_error
$ cat ~/myfiles/stores/{/startrek/{/video/audio/}/@@/syntax_error
Nesting tag groups is not allowed
$

The error message is created when the user does the ls and is listed as the only result and stored in the hash table. When the user calls cat, the message is retrieved from the hash table and printed.

13. Relation loop detection

Currently a user can create looping relationships between tags:

$ mkdir ~/myfiles/relations/tag1/includes/tag2
$ mkdir ~/myfiles/relations/tag2/includes/tag1

Tagsistant should detect this situation and prevent the second mkdir().

14. Trash tag

Currently when a file is untagged for its last tag, it’s also deleted from Tagsistant. This could happen without the user being fully aware of that. Tagsistant should implement an trash tag where files are moved (tagged, to be precise) when deleted from their last tag. The trash feature should be selectable from the command line using a switch. When the user deletes files from the trash tag, the files are actually deleted.