CIYAM Docs - Understanding the File System

Understanding the File System
-----------------------------

Immediately below the directory where the application server operates is a "files" directory which allows the
application server to function as a file storage system accessible via the standard protocol as well as being
accessible via the peer protocol.

The way this file system operates is quite similar to how "git" operates under the covers and so will be very
familiar to those who have a thorough knowledge of that system. Each file is stored in a directory whose name
is the first two characters of the hex ASCII string of the SHA256 hash of the file content (uncompressed) and
whose filename is the remaining characters of the hash. In order to reduce the disk space footprint files are
normally compressed when they are created (although this is optional).

Because the name of the file is a secure hash of its content files cannot be edited (as a new version will of
course have a new hash). For both this reason and the fact that SHA256 hashes are not things that most people
could ever hope to remember "file tags" are used as labels for locating files.

To see the list and usage of all the file system commands issue the following command from "ciyam_client":

> ? file*

When starting "ciyam_server" the files area directory can be changed with the "-files=<directory>" option and
it should be noted that the system log files ("ciyam_script.log" and "ciyam_server.log") are created/appended
in the files area directory (although they are not considered to be part of the file storage system).

Fundamental File Types
----------------------

The file system supports two fundamental file types which are Blob and List. The first byte of each file will
identify the fundamental file type as well as holding some other bit flags that will be explained further on.
It is important to note that the SHA256 hash includes this first byte (so do not expect the SHA256 hash of an
external file to match that of a file copied into the files area using "file_put").

The Blob file type is used to store any kind of arbitrary data (without any formal meta-data so that it isn't
possible to actually know what kind of data it is that is being stored). The List file type provides a method
of both tying names to other files and constructing a directed acyclic graph of other files through the items
being themselves other List files.

The following illustrates the type, full path and the content of the fundamental files types:

[blob - files/cc/eeb7a985ecc3dabcb4c8f666cd637f16f008e3c963db6aa6f83a7b288c54ef]
^Ahello

[list - files/02/9eaf5e442ca27b4d9dccba2cd5ace42e7f98ee33c6ab69d19ea55656f1f84a]
^Bcceeb7a985ecc3dabcb4c8f666cd637f16f008e3c963db6aa6f83a7b288c54ef first

It should be noted that the caret symbol is being used to indicate control characters in the above content so
the actual values of the first character in each of the files above are 0x01 and 0x02 respectively. The above
files can be created using the standard application server protocol with the following commands:

> file_raw -text blob hello
> file_raw -text list "cceeb7a985ecc3dabcb4c8f666cd637f16f008e3c963db6aa6f83a7b288c54ef first"

The "-text" option is used to prevent file compression which is useful for testing purposes (but not normally
recommended). The "file_info" command can be used to examine the information about a file, and if the file is
a List, to examine the information of referenced files as well.

> file_info -recurse -d=2 029eaf5e442ca27b4d9dccba2cd5ace42e7f98ee33c6ab69d19ea55656f1f84a
[list] 029eaf5e442ca27b4d9dccba2cd5ace42e7f98ee33c6ab69d19ea55656f1f84a (71 B)
first
 [blob] cceeb7a985ecc3dabcb4c8f666cd637f16f008e3c963db6aa6f83a7b288c54ef (6 B) [utf8]
hello

The above shows the List that had been created contains a single "list item" with the name "first" which then
points to a Blob that contains the text "hello". Considering that any of the "list items" can also themselves
be lists it should be clear that this is extremely flexible, however it is not clear from such an example how
the hash for a particular List (or Blob if desired) can easily be found which is the purpose of "file tags".

File Tags
---------

The "file tags" enable linking of human readible file names to the file hashes in the file system. These tags
function in much the same way as "branch tags" in git work (under the hood). To show how these are useful the
following command tags the List file that was created before.

> file_tag 029eaf5e442ca27b4d9dccba2cd5ace42e7f98ee33c6ab69d19ea55656f1f84a start

Now we can see the same "file info" output we saw above with the following more user friendly command:

> file_info -recurse -d=2 start
[list] 029eaf5e442ca27b4d9dccba2cd5ace42e7f98ee33c6ab69d19ea55656f1f84a (71 B)
first
 [blob] cceeb7a985ecc3dabcb4c8f666cd637f16f008e3c963db6aa6f83a7b288c54ef (6 B) [utf8]
hello

A simplified version that doesn't expand the file item content is as follows:

> file_info -recurse start
[list] 029eaf5e442ca27b4d9dccba2cd5ace42e7f98ee33c6ab69d19ea55656f1f84a (71 B)
first
 [blob] cceeb7a985ecc3dabcb4c8f666cd637f16f008e3c963db6aa6f83a7b288c54ef (6 B)
 ...

So rather than trying to recall "029eaf5e442ca27b4d9dccba2cd5ace42e7f98ee33c6ab69d19ea55656f1f84a" it is only
required to recall "start". So if the List represents a traditional "file name" then imagine that we now want
to rename the "file" named "first" to "greetings".

Firstly we'd create a new List with the following command:

> file_raw -text list "cceeb7a985ecc3dabcb4c8f666cd637f16f008e3c963db6aa6f83a7b288c54ef greetings"

Then secondly we would switch the "start" tag to the hash of the newly created List as follows:

> file_tag 5a5df977b247df3c493deca2977ce81e7ea4b3451ce393c6cc85cd879864bc58 start

It should also be noted that the tag could have been included in the "file_raw" command with this usage:

> file_raw -text list "cceeb7a985ecc3dabcb4c8f666cd637f16f008e3c963db6aa6f83a7b288c54ef greetings" start

Now we repeat our List recurse command as follows:

> file_info -recurse start
[list] 5a5df977b247df3c493deca2977ce81e7ea4b3451ce393c6cc85cd879864bc58 (75 B)
greetings
 [blob] cceeb7a985ecc3dabcb4c8f666cd637f16f008e3c963db6aa6f83a7b288c54ef (6 B) [utf8]
 ...

What can be seen here is that the Blob in both the above and previous "list recurse" output is the same which
also means that although we have two List files we actually only have a single file with the "hello" content.
So in this way Blobs can be used with different Lists but disk space is not being wasted as only one file per
"hash" ever exists.

If the appearance of "..." (rather than actual content) is not wanted and also to recurse multiple levels of
lists the value 999 should be used for the depth argument (also useful if the blob content is expected to be
large as the content will not be retrieved for them making the command much faster if there are many blobs):

> file_info -recurse -d=999 start
[list] 5a5df977b247df3c493deca2977ce81e7ea4b3451ce393c6cc85cd879864bc58 (75 B)
greetings
 [blob] cceeb7a985ecc3dabcb4c8f666cd637f16f008e3c963db6aa6f83a7b288c54ef (6 B)

Although only one file per hash exists each file can have multiple tags which can be particularly useful when
identifying things by a specific version or simply as the latest version. To illustrate we'll add two version
specific tags to the two List files we had previously created.

> file_tag 029eaf5e442ca27b4d9dccba2cd5ace42e7f98ee33c6ab69d19ea55656f1f84a start1
> file_tag 5a5df977b247df3c493deca2977ce81e7ea4b3451ce393c6cc85cd879864bc58 start2

To list all of the tags matching a wildcard pattern use the following:

> file_tags start*

and to see the file hash for a specific tag use the following:

> file_hash start

To remove file tags use the "file_tag" command with either the "-remove" or "-unlink" option. If "-unlink" is
used then the file itself will be deleted if no other tags are linked to it whereas the "-remove" option will
only ever remove the tag and will never delete the file that it links to.

It should be noted that the "file_tags" command can be given a comma separated list of tag names which allows
multiple tags to be added or deleted/unlinked using the one command. Also when creating a "list" file via the
"file_raw" command a comma separated list of tags can be used as a simpler method to construct the list items
(assuming each list item file has already been tagged and that this is also the desired name for the item).

Copying Files
-------------

A simple way to copy a file into the file system from the current directory is illustrated as follows:

> file_put mnemonics.txt xxx

This command copies the file "mnemonics.txt" into the file system along with creating a file tag "xxx" for it
(which makes it easier to locate). If not wanting to use a file tag the hash can be determined from a session
variable named "@last_file_put" as illustrated here:

> session_variable @last_file_put
09bc435dd5710bc24064fbbaf1e1b4cffa1aad704a15a3c90fd6d98d64fcc89f

In order to fetch this copy from the file system use the following:

> file_get xxx

It should be noted that the file content (originally sourced from "mnemonics.txt") will have been copied into
a file called "xxx" in the current directory (to use a different filename add another argument to the command
such as "file_get xxx copy_of_mnemonics.txt").

Because the file system imposes a maxiumum file size limit the "file_put" command (only when this command has
been issued via "ciyam_client") can prefix the file name with a byte size followed by an asterisk as follows:

> file_put 1K*test.jpg tst.jpg

This will copy "test.jpg" into the file system as a number of "chunks" that are each no more than 1kB in size
(which will be flagged to not be compressed). It will also create a "list" file that references each of these
chunks and the "tst.jpg" tag is added to this "list" (if the "tst.jpg" tag was omitted then hash of this list
can be found in the session variable "@last_file_put").

The following command will output the content of the "tst.jpg" list file:

> file_info -content tst.jpg
[list] 2d3c89f8f5301604234589e08a695e3ab0bdaa5f99ec21cf148b99d13020cb85 (307 B)
8f23a8e586d7975095e740da1d43f95cfa057816e1cabddd9e627541a0b6b59d test.jpg.000000
cd60cc9598f0831062978f58f3b2f04582c2a2b7efa7876dd03dcd058f2d8b74 test.jpg.000001
e60982ce0b124d64f4f8c8cd2ab2b8fd9bd46c1f022aa43f4afd4618bdd056e7 test.jpg.000002
54749b4da930e6db6938a0f6393f5fa1c4dba0a148e9c23da10bed11c081b6f5 test.jpg.000003
579ad8961e4dc03ebf3965de840ff2eac2b500fde9a144d4f8d67c77ae01e686 test.jpg.000004
096b48069f1b3d0dc3a2b650661e6bbf94a5afe3a1c6694745c4505fcee8e1c2 test.jpg.000005
15515fff444eb94e0d3e0074f4c772a8bed8ec9a01bc40c691122e812adedea7 test.jpg.000006

It should be noted that the original "test.jpg" filename has been kept with each list item and a chunk number
of six digits has been appended (so it might be a preferable idea to use the original file name as the tag to
avoid confusion but using it this way here shows clearly that the tag is is irrelvant for the file name which
will be used when extracting and recombining the file chunks next).

The following command (only when issued by "ciyam_client") extracts the original "test.jpg" file to "xxx.jpg"
by appending each of the chunks identified in the list that had been tagged with "tst.jpg" previously:

> file_get tst.jpg *xxx.jpg

It should be noted that to keep the file's original name (such as "test.jpg") just "file_get tst.jpg *" could
be used instead. Also if wanting to keep the original file's name but extract the file below another path the
following could instead be used (assuming the directory "tmp" exists below the current directory):

> file_get tst.jpg *tmp/

When very large files are transferred (or a large number of file chunks are processed) a "progress indicator"
(depending upon whether using a console and startup options) will appear. This indicator can be prefixed with
the value of the environment variable FILE_NAME. The following is an example of how this can be utilised:

> FILE_NAME=xxx

> file_put 1K*Meta.so $FILE_NAME
xxx (66%)

File Encryption
---------------

Any file can be encrypted with a password using the "file_crypt" command. It should be noted that once a file
has been encrypted no information other than its original content hash and size will be returned with the use
of "file_info". If the file is a "list" and the "-recurse" option is used then all of the list items are also
encrypted (and child lists will cascade resulting in the entire tree of files being encrypted). If wanting to
only encrypt the blobs (i.e. leaf nodes) then use additional optional '-blobs_only' argument.

> file_crypt tst.jpg password

> file_info -content tst.jpg
[list] 2d3c89f8f5301604234589e08a695e3ab0bdaa5f99ec21cf148b99d13020cb85 (305 B) [***]

To decrypt simply repeat the "file_crypt" command (with the matching "password"). It should be noted that any
tags associated with an encrypted file will remain (only the file content itself is affected) and because the
encryption used is a fast "stream cipher" the password would need to be quite large (such as more than twenty
mixed upper/lower case letters, numbers and special characters) to be effective against brute force cracking.

Deleting Files
--------------

For the "xxx" tag created previously the file could be deleted using the "file_tag" command as follows:

> file_tag -unlink xxx

But if any other tags are linked to the same file content then this command will not actually delete the file
therefore to force file deletion the "file_kill" command would instead be used as follows:

> file_kill xxx

It should be noted that any tags that were linked to this file will now have been removed but some care would
be advised before deciding to use "file_kill" as this could effectively leave "list" content with now invalid
entries.

For the case of the "tst.jpg" list (which contains the "test.jpg" file chunks) the list as well as each child
list item (and any further descendant branch lists if they were present) can be removed with the following:

> file_kill -recurse tst.jpg

File System Statistics
----------------------

The "file_stats" command can be used to see a summary of the file system usage:

> file_stats
[3/100000]152 B/100.0 GB:4 tag(s)

The response says that 3 out of 100,000 possible files are currently being stored in the files area and these
files are using 152 bytes of the maximum possible capacity of 100 GB being used with a total of 4 file tags.

There are two settings in "ciyam_server.sio" configuration file that are relevant to the file system and they
are as follows (uncomment and change then restart "ciyam_server" for the new settings to take effect):

# <files_area_item_max_num>100K
# <files_area_item_max_size>1M

Although it might be useful to increase the maximum number of files permitted (assuming available disk space)
it should be noted that changing the maximum file size should only be for testing purposes as this value will
need to be identical between all peer connections.

It is not recommended to do any changes to the "files" subdirectory directly but if this has occurred then it
is possible to re-synchronise the application server with the following command:

> file_resync

Note that this command's output mimics that of "file_stats" as a way to easily check if changes had occurred.
It should also be noted that files that ought to be tags (but are either not tag files or are somehow invalid
due to external changes made in the files area directory) will be listed in the application server log. These
files can be automatically deleted by using the command option "-remove_invalid_tags").

Although the files system directory is called "files" by default (and is located directly below the directory
that is the application server's current working directory) it can be changed via the special system variable
"@files_area_dir" (which will result in an automatic "file_resync" when set).

Core Files
----------

The other flags that were mentioned previously include one flag to indicate if the file is compressed as well
as a flag to indicate that a file is a "core file" and another to indicate that the file is in MIME format.

The "core file" format is a plain text format that begins with a three letter type followed by a colon. After
the colon a header typically follows which consists of comma separated attributes in the form of an attribute
id being assigned a value (e.g. a=123,b=xyz). Some of the core files will include further lines of text which
may or may not be prefixed by a detail type followed by a colon.

Core files (much like configuration files) are essential for the application server so should not be manually
tagged, created or removed (as doing so could risk corrupting application behaviour).

File Archives
-------------

As the file system has a specific limit to the number of files that it allows a file archiving implementation
has been created for storing files that are not currently being used. It should be noted that each archive is
the combination of an external path and information stored about the archive in the server's global ODS DB.

Although disparities between the external file storage and ODS information can be easily fixed direct changes
to these external archive files are not recommended.

Archive Maintenance
-------------------

The "file_archive" command is used to add/remove/repair or even destroy an archive. The path that you wish to
use for an archive is expected to already exist (i.e. it will not be created) as most often the path would be
a "mount point".

Assuming the path for the archive is "/mnt/a001", the name chosen for the archive is "001" and it is intended
to hold up to 100GB of data the create command would be as follows:

file_archive -add /mnt/a001 100GB 001

To see a list of archives that also displays their last known status use "file_archives". Assuming that there
was no issue with the path (when adding it tests that it can access the path and write to it) then the output
from the "file_archives" command would be as follows:

001 [okay      ] (0 B/100.0 GB) /mnt/a001

If the path wasn't accessible then you would instead see the following:

001 [bad access] (0 B/100.0 GB) /mnt/a001

Assuming the issue with the path was then resolved use the "-status_update" option for "file_archives" for it
to re-check that it can access and write to the path. It should be noted that the size that was specified for
the archive creation should be a fair bit less than the physical size of the storage device as the file names
used are sixty four characters each (and typically an OS uses quite a lot more bytes per file as overhead).

If an archive is removed (by using the "-remove" option with the "file_archive" command) then the files which
are already in the archive will remain (only size information will be removed from the global ODS DB). If one
wants to have the archived files removed as well then the "-destroy" option should instead be used. Even when
using "-destroy" it should be noted that the actual directory for the archive will not be removed.

If an archive was unintentionally removed or if it is suspected to be in an inconsistent state then "-repair"
can be used to reconstruct the size information within the ODS DB (this command can take some time to execute
if there are a very large number of files in the archive).

It would be recommended to use numbers (maybe use year and month created if starting at "001" as above is not
appealing) so that one can easily see which archives are older. In this way when all of the available storage
space has been nearly exhausted one could destroy the oldest archive to make room for a new one.

Archiving Files
---------------

Typically when a file is stored (that is not a "core file") it will be automatically tagged with a time stamp
prefixed by "ts.". The following is a repeat of the earlier "file_put" but without providing a tag:

file_put mnenomics.txt

Now if the command "file_tags ts.*" is issued the output would look something like this:

ts.2018123018404900000

As the tag is based upon the current time when multiple such tags are found they will be output in order from
oldest to most recently added. This is then used to determine which files to automatically archive (or simply
remove if no archive space is available). Whenever a new file is being added into the file system an existing
file will be archived (and/or removed) if the file system has already stored its maximum number of files.

Files can be manually archived using the "file_relegate" command. Assuming the file system only contained the
tag shown above then assuming an archive with available space exists the following command would archive this
file:

file_relegate -n=1

If more than one archive exists then the first archive (in order of the list of archives) that has sufficient
space will be chosen. Normally if a file is found to have already been archived then it would just be removed
but this can be overridden if the optional archive name argument was provided in the "file_relegate" then the
file will be archived in the named archive (even if found to already be stored in another existing archive).

Retrieving Archived Files
-------------------------

The command "file_retrieve" will move a file from the archives into the files area. The file's identity which
is the SHA256 "name" must be known in order to retrieve a file from the archives so it this command would not
generally be manually issued. An optional number of days to be added to the current time stamp can be used so
the file might stay in the files area longer than other time stamp tagged files. Instead of this one can pass
the command a specific tag for the file being retrieved which will effectively keep the file from being later
archived. The following command will retrieve the file archived previously giving it the permanent tag "xxx":

file_retrieve 09bc435dd5710bc24064fbbaf1e1b4cffa1aad704a15a3c90fd6d98d64fcc89f xxx

If no tag is provided then a new "ts.<time stamp>" tag will be automatically supplied for the file.

Permanently Deleting All File Copies
------------------------------------

The command "file_relegate" has a "-destroy" option which if used will delete every archived copy of the file
identified as well as remove the file from the files area (if it exists there as well). A "-blacklist" option
can be used instead of the "-destroy" one which will add the hash of the file being removed from the archives
and the files area to a "blacklist" which belongs to the global ODS DB. This will prevent the file from being
later stored via a "file_put" (or "put" for the blockchain peer protocol) into the files area and archives.