Commit c91eab41 authored by gerd's avatar gerd

continued docs


git-svn-id: https://gps.dynxs.de/private/svn/app-plasma/trunk@231 55289a75-7b90-4627-9e07-ffb4263930b2
parent 0d97ec14
.PHONY: html
DOCS = plasma_start.txt
SRC = ../src
LOAD = \
-load $(SRC)/plasmasupport/plasmasupport.idoc \
-load $(SRC)/plasmaclient/plasmaclient.idoc \
-load $(SRC)/pfs_datanode/pfs_datanode.idoc \
-load $(SRC)/pfs_namenode/pfs_namenode.idoc \
-load $(SRC)/pfs_support/pfs_support.idoc \
-load $(SRC)/pfs_nfs3/pfs_nfs3.idoc \
-load $(SRC)/mr_framework/mapred.idoc
DOCS = plasmafs_start.txt \
plasmafs_deployment.txt \
commands/cmd_plasma.txt \
commands/cmd_plasmad.txt \
commands/cmd_plasma_datanode_init.txt \
commands/cmd_plasma_admin.txt \
commands/cmd_nfs3d.txt
html: $(DOCS) plasma_intro.txt odoc/chtml.cmo
rm -rf html
mkdir -p html
cp *.png html/
cp *.png style.css html/
ocamldoc -i odoc -g chtml.cmo -d html -stars -css-style style.css \
-t "PlasmaFS Documentation" -intro plasma_intro.txt \
$(DOCS)
-t "Plasma Documentation" -intro plasma_intro.txt \
$(LOAD) $(DOCS)
odoc/chtml.cmo:
cd odoc && $(MAKE)
.PHONY: clean
clean:
rm -rf html
cd odoc; $(MAKE) clean
{1 nfs3d - daemon for the NFS bridge}
{2 Synopsis}
{[
nfs3d -conf file [-fg] [-pid file]
]}
{2 Description}
This is the daemon acting as an NFS server and forwarding requests to
the PlasmaFS cluster (namenodes and datanodes). The daemon implements
the [nfs] and [mountd] programs of NFS version 3. (Caveat: Early releases
of nfs3d are restricted to read-only mounts.) There is no support for
the [nlockmgr] protocol yet.
There is no access control yet. PlasmaFS stores the owners of files by
name, whereas NFS transmits numeric IDs. A translation facility is still
missing.
An instance of the NFS bridge can only connect to a single PlasmaFS
cluster.
The NFS bridge can only be contacted over TCP. There is no UDP support,
and it is also not planned. NFS runs well over TCP.
NFS clients can mount PlasmaFS volumes as in (Linux syntax):
{[
mount -o intr,port=2801,mountport=2800,nolock <host>:/<clustername> /mnt
]}
Here, [<host>] is to be replaced by the machine running the NFS
bridge. [<clustername>] is the name of the cluster. The port numbers
might need adjustments - we assume the same numbers are used as in
the examples.
For a single PlasmaFS cluster there can be multiple NFS bridges. This
makes it, for example, possible to install a separate NFS bridge on
each client machine. The NFS protocol is then only run locally on the
client machine, and does not even touch the real network.
NFS (version 3) only implements weak cache consistency: An NFS client
usually caches data as long as nothing is known about a possible
modification, and modifications can only be recognized by changed
metadata (i.e. the mtime in the inode is changed after a
write). Although NFS clients typically query metadata often, it is
possible that data modifications remain unnnoticed. This is a problem
in the NFS protocol, not in the bridge. The PlasmaFS protocol has
better cache consistency semantics, especially it is ensured that a
change of data is also represented as an update of the
metadata. However, the different semantics may nevertheless cause
incompatibilities. For example, it is allowed for a PlasmaFS client to
change data without changing the mtime in the inode. Within the
PlasmaFS system this is not a big problem, because there are other
means to reliably detect the change. An NFS client connected via this
bridge might not see the update, though, and may continue to pretend
that its own cache version is up to date. All in all, it is expected
that these problems are mostly of theoretical nature, and will usually
not occur in practice.
NFS version 3 can deal with large blocks in the protocol, and some
client implementations also support that. For example, the Linux
client supports block sizes up to 1M automatically, i.e. this is the
maximum transmission unit for reads and writes. Independently of the
client support, the NFS bridge translates the sizes of the data blocks
used in the NFS protocol to what the PlasmaFS protocol requires. This
means that the NFS bridge can handle the case that the client uses
data sizes smaller than the PlasmaFS block size. There is a performance
loss, though.
{2 Options}
- [-conf file]: Reads the configuration from this file. See below for
details.
- [-fg]: Prevents that the daemon detaches from the terminal and puts
itself into the background.
- [-pid file]: Writes this pid file once the service process is forked.
{2 Configuration}
The configuration file is in [Netplex] syntax, and also uses many features
from this framework. See the documentation for [Netplex] which is available
as part of the [Ocamlnet] library package. There are also some explanations
here: {!Cmd_plasmad}.
The config file looks like:
{[
netplex {
controller {
... (* see plasmad documentation *)
};
namenodes {
clustername = "<name>";
node_list = "<nn_list>";
port = 2730;
};
service {
name = "Nfs3";
protocol {
name = "mount3";
address {
type = "internet";
bind = "0.0.0.0:2800"
}
};
protocol {
name = "nfs3";
address {
type = "internet";
bind = "0.0.0.0:2801"
}
};
processor {
type = "nfs";
nfs3 { };
mount3 { };
};
workload_manager {
type = "constant";
threads = 1;
};
};
]}
Parameters:
- [clustername] is the name of the PlasmaFS cluster.
- [node_list] is a text file containing the names of the namenodes, one
hostname a line.
It is not advisable to use the official NFS ports, or to register
the NFS ports with portmapper.
{2 How to shut down the daemon}
First, one should unmount all NFS clients. There is no way for an NFS
server to enforce unmounts (i.e. that clients write all unsaved data).
The orderly way for shutting down the daemon is the command
{[
netplex-admin -sockdir <socket_directory> -shutdown
]}
[netplex-admin] is part of the [Ocamlnet] distribution. The
socket directory must be the configured socket directory.
It is also allowed to do a hard shutdown by sending SIGTERM signals to
the {b process group} whose ID is written to the pid file. There is no
risk of data loss in the server because of the transactional
design. However, clients may be well confused when the connections
simply crash.
{1 plasma - command-line access to PlasmaFS files}
{2 Synopsis}
{[
plasma list <general options> [-1] pfs_file ...
plasma create <general options> [-rep n] pfs_file ...
plasma mkdir <general options> pfs_file ...
plasma delete <general options> pfs_file ...
plasma put <general options> [-rep n] [-chain] local_file pfs_file
plasma get <general options> pfs_file local_file
]}
General options:
{[
-cluster <name> -namenode <host>:<port>
]}
{2 Description}
The utility [plasma] allows one to directly access files stored in
PlasmaFS via the PlasmaFS-specific RPC protocol.
All [pfs_file] arguments refer to the file hierarchy of the
PlasmaFS cluster. For now, all such files need to be absolute,
and there is no symlink resolution.
{2 General options}
- [-cluster name]: Specifies the name of the PlasmaFS cluster.
This is a required option.
- [-namenode <host>:<port>]: Specifies the namenode to contact.
This option can be given several times - the system searches then
for the right namenode. It is required that a namenode is known.
{2 [list] subcommand}
[list] lists files (like Unix [ls]).
There is only one special option:
- [-1]: Outputs one file per line, and nothing else. Without this
option, [list] outputs like [ls -l].
{2 [create] subcommand}
[create] creates a new file (which must not exist already).
Option:
- [-rep n]: Creates the file with [n] replicas. [n=0] means the
server default, which is also the default if there is no [-rep]
option.
{2 [mkdir] subcommand}
[mkdir] creates a new directory (which must not exist already).
{2 [delete] subcommand}
[delete] removes an existing file, or an existing and empty directory.
{2 [put] subcommand}
[put] creates a new file in PlasmaFS, and copies the contents of
[local_file] to it. [local_file] must be seekable for now.
Options:
- [-rep n]: Creates the file with [n] replicas. [n=0] means the
server default, which is also the default if there is no [-rep]
option.
- [-chain]: By default, the file is copied using star topology
(i.e. a block is independently copied to all datanodes holding
replicas). The [-chain] switch changes this to the chain topology
where a block is first copied to one datanode, and from there to
the other datanodes storing the replicas.
{2 [get] subcommand}
[get] downloads a file from PlasmaFS to the local filesystem.
{1 plasma_admin - managing namenodes}
{2 Synopsis}
{[
plasma_admin add_datanode <nn_options> -size <blocks> <identity> ...
plasma_admin enable_datanode <nn_options> <identity> <dn_host>:<dn_port>
plasma_admin disable_datanode <nn_options> <identity>
plasma_admin list_datanodes <nn_options>
plasma_admin destroy_datanode <nn_options> <identity>
plasma_admin fsck -conf <namenode_config_file>
]}
where <nn_options>:
{[
-namenode <host>:<port> -cluster <name>
]}
{2 Description}
[plasma_admin] is used for making administrative changes to namenodes.
Right now, there are two types of operations:
- Managing the datanodes that are connected with the namenode cluster
- Doing a consistency check of the namenode database
{2 Managing datanodes}
Every datanode has a unique identity which is created when the
datanode is initialized (via {!Cmd_plasma_datanode_init}). The
identity is primarily stored on the disk of the datanode. The
namenodes maintain a table of known datanode identities. Each
identity can be disabled, enabled, or even be connected with a
running datanode:
- For a {i disabled} identity it is not known on which machine
the datanode server runs or might be running. Files can have
blocks that are stored on a disabled identity, but these blocks
are inaccessible as long as the identity remains disabled.
The namenode never tries to allocate new blocks for disabled
identities. This state is intended for temporarily removing
a datanode server from the PlasmaFS cluster, e.g. for
machine maintenance.
- An {i enabled} identity is usually connected with a running
datanode, but in certain circumstances it is not. Actually,
an enabled identity not associated to a server is an error
condition. This especially occurs if the datanode crashes or
is otherwise unavailable. Operationally, this state is handled
in the same way as a disabled identity. The difference, however,
is that it is tried to reconnect to the datanode once it is
back up. (This function is not yet available in early PlasmaFS
releases, though.)
- An {i enabled and connected} identity is backed by a running
datanode server. It is fully operational.
The identity of a datanode is the permanent identifier that is stored
in the namenode database. At runtime of the cluster, the hostname of
the machine serving connected identities is also known to the
namenode, but it is not stored on disk. Because of this, it is
possible to relocate datanodes at runtime by disabling the identity,
moving the files storing the data for the identity to a different
machine, and re-enabling the identity there.
Note that the namenode config file also enumerates datanodes. This
list of nodes, only given as host and port (and lacking identity
strings), is only used for the automatic discovery of datanodes at
cluster startup time. When the namenode server is started, it connects
to the datanode servers listening to these ports, and automatically
sets the state of these identities to {i enabled and connected} if it
finds the identities in the database, and the identities are enabled.
The following subcommands all require that the namenodes are up and
running.
{3 [add_datanode] subcommand}
The [add_datanode] subcommand adds the identity on the command line
to the database. It also puts [size] into the db, given as number
of blocks. This [size] should match the size of the data file - however,
this cannot be checked at [add_datanode] time. Be careful to pass the
right [size].
{3 [enable_datanode] subcommand}
The [enable_datanode] subcommand sets the identity to enabled, and
tries to connect it to the datanode server listening on [<dn_host>] and
[<dn_port>]. If the connection cannot be established, the identity is
nevertheless enabled although it remains unconnected to a server.
{3 [disable_datanode] subcommand}
This subcommand disables the identity on the command-line.
{3 [list_datanodes] subcommand}
Lists the known identities and the associated states. Sample output:
{[
20a7df016a330c6d5c5459e409edc14b disabled not associated to datanode
5df72474031271d41cf15c1ca6ef6a4f enabled not associated to datanode
47aa6893145cb75ebe3fab17d0a6521f enabled not associated to datanode
2c9c71b888e367ba75a36e6a6c46e2a8 enabled 192.168.5.30:2728
3347e7d4719a0a1333313a736621123a enabled 192.168.5.40:2728
OK
]}
{3 [destroy_datanode] subcommand}
The identity is entirely removed from the namenode database. This
includes the block lists of the files, i.e. all information about the
blocks for the identity is lost.
{2 Managing the namenode database}
{3 [fsck] subcommand}
This command checks the namenode database for inconsistencies. This
especially includes the correctness of the blockmaps, i.e. the tables
that store whether a block is free or allocated. The blockmaps are
compared with the block lists attached to the inodes.
It is required that the namenode servers are down (i.e. not accessing the
database at the same time).
The [fsck] subcommand does not try to repair the blockmaps.
Options:
- [-conf file]: The [file] must be the config file of the namenode server.
Effectively, only the parameters from the [database] section are interpreted.
{1 plasma_datanode_init - initialize datanode}
{2 Synopsis}
{[
plasma_datanode_init -blocksize <blocksize> <directory> <blocks>
]}
{2 Description}
Creates two files in [directory]: [config] and [data]. The [data]
file is created as empty sequence of [blocks] blocks, where every
block consists of [blocksize] bytes.
Also, a datanode identity is created (32 hex digits).
The identity and the blocksize are put into [config].
After the directory is prepared with this command, it can be referenced
in the [directory] parameter of a datanode config file.
{1 plasmad - daemon for datanodes and namenodes}
{2 Synopsis}
{[
plasmad -conf file [-fg] [-pid file]
]}
{2 Description}
This is the daemon implementing datanode and namenode services. [plasmad]
is a collection of services which can be selectively enabled from the
configuration file. By choosing certain sets of services, one gets either
a datanode or a namenode server.
The configuration file is in [Netplex] syntax, and also uses many features
from this framework. See the documentation for [Netplex] which is available
as part of the [Ocamlnet] library package. A working subset is described
below.
{2 Options}
- [-conf file]: Reads the configuration from this file. See below for
details.
- [-fg]: Prevents that the daemon detaches from the terminal and puts
itself into the background.
- [-pid file]: Writes this pid file once the service process is forked.
{2 General configuration file layout}
A config file generally looks like:
{[
netplex {
controller {
socket_directory = "<socket_directory>";
max_level = "debug"; (* Log level, also "info", "notice", "err", ... *)
logging {
...
}
};
service {
name = "<service_name>";
...
};
<custom_section> {
...
};
}
]}
Without going too much into detail:
- {i Sections} have the form {[ <name> { ... } ]}
- {i Parameters} have the form {[ <name> = <value> ]}
- Sequences of sections/parameters are delimited with ";"
- Comments are between (* and *) (as in Ocaml)
- Parameter values can be "strings", or integers (123), or floats
(123.4), or bools (true/false)
The [<socket_directory>] is a place where the daemon puts runtime files
like Unix Domain sockets. Each instance of a daemon must have a separate
socket directory.
{2 Logging}
Log messages can go to stderr, to files, or to syslog. Please see the
documentation in [Netplex_log] for details. A simple logging specification
would be:
{[
logging { type = "stderr" }
]}
{2 Config file for datanodes}
For a datanode the config file looks like:
{[
netplex {
controller {
... (* see above *)
};
datanode {
clustername = "<name>";
directory = "<data_dir>";
blocksize = <blocksize>; (* int *)
io_processes = <p>; (* int *)
shm_queue_length = <q>; (* int *)
sync_period = <s>; (* float *)
};
service {
name = "Dn_manager";
protocol {
name = "RPC";
address {
type = "internet";
bind = "0.0.0.0:2728"
};
address {
type = "local";
path = "<rpc_socket>";
}
};
processor {
type = "dn_manager";
};
workload_manager {
type = "constant";
threads = 1;
};
};
}
]}
Parameters:
- [clustername] is the name of the PlasmaFS cluster. All namenode and
datanode daemons must be configured for the same name.
- [datadir] is a local directory where the datanode can store
blocks. The daemon expects two files in this directory:
[config] and [data]. These files can be created with the utility
{!Cmd_plasma_datanode_init}.
- [blocksize] is the block size in bytes. Should be in the range
65536 (64K) to 67108864 (64M). The size must be divisible by the
page size (4096). The block size of all datanodes must be the same.
- [io_processes] is the number of I/O processes to start. Effectively,
this is the number of parallel I/O requests the datanode server can
submit to the kernel at the same time. A low number like 8 or 16
suffices in typical deployments.
- [shm_queue_length] is the number of blocks the datanode server
can buffer up in shared memory. These buffers are used for speeding
the communication between the main datanode process and the I/O
processes up. A small multiple of [io_processes] should be good.
- [sync_period] says after how many seconds written blocks should be
synced to disk. The higher the value the more efficient is the
sync, but the longer clients have to wait until the sync is done.
Values between 0.1 and 1.0 seem to be good.
- [rpc_socket]: The path to a Unix Domain socket where the datanode
can also be contacted in addition to the internet socket. The socket
can live in the socket directory.
{2 Config files for namenodes}
For a namenode the config file looks like:
{[
netplex {
controller {
... (* see above *)
};
database {
dbname = "<name_of_postgresql_database>";
(* maybe more options, see below *)
};
namenodes {
clustername="<cluster_name>";
node_list = "<nn_list>";
port = 2730;
rank_script = "ip addr show label 'eth*' | grep link/ether | awk '{print $2}'"; (* see below *)
inodecache { port = 2740 };
};
datanodes {
node_list = "<dn_list>";
port = 2728;
blocksize = <blocksize>;
};
service {
name = "Nn_manager";
protocol {
name = "RPC";
address {
type = "internet";
bind = "0.0.0.0:2730"
};
address {
type = "local";
path = "<manager_socket>";
};
};
processor {
type = "nn_manager";
};
workload_manager {
type = "constant";
threads = 1;
};
};
service {
name = "Nn_inodecache";
protocol {
name = "RPC";
address {
type = "internet";
bind = "0.0.0.0:2740"
};
address {
type = "container";
};
};
processor {
type = "nn_inodecache";
};
workload_manager {
type = "constant";
threads = 1;
};
};
}
]}
Parameters in [database]:
- The [database] section can include more parameters. See the function
{!Pfs_db.extract_db_config} for a complete list.
Parameters in [namenodes]:
- [clustername] is the name of the PlasmaFS cluster. All namenode and
datanode daemons must be configured for the same name.
- [nn_list] is a text file containing the names of the namenodes, one
hostname a line.
- The [rank_script] is quite a special parameter. Actually, one has to
specify either [rank] or [rank_script]. [rank] is simply a string,
and [rank_script] is a script writing this string to stdout.
Every namenode instance must be configured with a different rank
string. If there are two instances with the same string, the cluster
will not start up. The above script is for Linux, and extracts MAC
addresses from all [eth*] network interfaces. The rank string is
used in the coordinator election algorithm. The node with the
lexicographically smallest string wins.
- A complete list of parameters can be found here:
{!Nn_config.extract_node_config}
Parameters in [datanodes]:
- [dn_list] is a text file containing the names of the datanodes, one
hostname a line. These datanodes are auto-discovered at cluster
startup.
- [blocksize] is the block size in bytes. Should be in the range
65536 (64K) to 67108864 (64M). The size must be divisible by the
page size (4096). The block size of all nodes must be the same.
Other:
- [manager_socket]: The path to a Unix Domain socket where the namenode
can also be contacted in addition to the internet socket. The socket
can live in the socket directory.
{2 How to shut down the daemon}
The orderly way for shutting down the daemon is the command
{[
netplex-admin -sockdir <socket_directory> -shutdown
]}
[netplex-admin] is part of the [Ocamlnet] distribution. The
socket directory must be the configured socket directory.
It is also allowed to do a hard shutdown by sending SIGTERM signals to
the {b process group} whose ID is written to the pid file. There is no
risk of data loss in the server because of the transactional
design. However, clients may be well confused when the connections
simply crash.
{1 PlasmaFS Documentation}
Plasma release: 0.1 "vorfreude". This is an alpha release to make
Plasma known to interested developers. This release contains:
- {{:#l_pfs_main} PlasmaFS: filesystem}
- {{:#l_pmr_main} PlasmaMapReduce: compute framework}
- {!Plasma_start}: Feature set, theory of operation
{1:l_pfs_main PlasmaFS Documentation}
PlasmaFS is the distributed transactional filesystem. It is implemented
in user space, and can be accessed via RPC calls, or via NFS.
- {!Plasmafs_start}: Feature set, theory of operation
- {!Plasmafs_deployment}: Deploying PlasmaFS
{2 PlasmaFS Commands}
- {!Cmd_plasma}: The [plasma] utility
- {!Cmd_plasmad}: The [plasmad] daemon for datanodes and namenodes
- {!Cmd_plasma_datanode_init}: The [plasma_datanode_init] utility for
initializing datanodes
- {!Cmd_plasma_admin}: The [plasma_admin] utility for doing namenode
administration
- {!Cmd_nfs3d}: The [nfs3d] daemon for the NFS bridge
{2 PlasmaFS Client Interfaces}
Client applications can only link with {b plasmasupport} and
{b plasmaclient}.
{3 [plasmasupport]: Support library}
{!modules:
Plasma_rng
}
{3 [plasmaclient]: RPC client}
{!modules:
Plasma_shm
Plasma_client
}
{2 PlasmaFS Internal Server Interfaces}
The following interfaces exist only within the server.
{3 [pfs_support]: Support library}
{!modules:
Pfs_db
Pfs_rpcapi_aux
Pfs_rpcapi_clnt
Pfs_rpcapi_srv
}
{3 [pfs_datanode]: Datanode server}
{!modules:
Dn_config
Dn_store
Dn_shm
Dn_io
Dn_manager
}
{3 [pfs_namenode]: Namenode server}
{!modules:
Nn_config
Nn_db
Nn_blockmap
Nn_datanode_ctrl
Nn_datastores
Nn_datastore_news
Nn_state
Nn_commit
Nn_elect
Nn_inodecache