Commit 5db2ce8b authored by gerd's avatar gerd

docs, build


git-svn-id: https://gps.dynxs.de/private/svn/app-plasma/[email protected] 55289a75-7b90-4627-9e07-ffb4263930b2
parent e76fe1d8
......@@ -23,8 +23,8 @@ Documentation:
- README
- GPL
- Release documentation, especially what is not yet working
- .x files
- Explanations about the transaction model
- .x files - DONE
- Protocol introduction; Explanations about the transaction model
- Use of shared memory
- Mapred instructions
- Mapred primer
......@@ -229,6 +229,13 @@ Mapred:
------------
problem ticket system: we do not distinguish between read and write
permission.
lookup: flag "repeatable"
Check: Switch to "repeatable read"?
problem namenode: should behave better when postgres connection limit
is exceeded, e.g. wait until connections become available again - DONE
......
......@@ -27,17 +27,25 @@ in user space, and can be accessed via RPC calls, or via NFS.
Client applications can only link with {b plasmasupport} and
{b plasmaclient}.
{3 [plasmasupport]: Support library}
{3 [plasmaclient]: RPC client}
{!modules:
Plasma_rng
Plasma_client
Plasma_shm
}
{3 [plasmaclient]: RPC client}
These are the mappings of the XDR definition to Ocaml (as used in
the client):
{!modules:
Plasma_shm
Plasma_client
Plasma_rpcapi_aux
Plasma_rpcapi_clnt
}
{3 [plasmasupport]: Support library}
{!modules:
Plasma_rng
}
{2 PlasmaFS RPC protocol definition}
......@@ -64,6 +72,12 @@ The following interfaces exist only within the server.
{!modules:
Pfs_db
}
These are the mappings of the XDR definition to Ocaml (as used in
the server):
{!modules:
Pfs_rpcapi_aux
Pfs_rpcapi_clnt
Pfs_rpcapi_srv
......
......@@ -15,11 +15,11 @@ There are a number of servers involved:
processes.)
- Inodecache namenode server: This is a separate server for maximizing
responsiveness. It implements the public RPC program
{Pfs_nn_inodcache.inodecache},
{!Pfs_nn_inodecache.inodecache},
and the internal RPC program {!Pfs_nn_internal.notifications}.
- Main datanode server: Clients can here read and write blocks. This
server implements the public RPC program {Pfs_datanode.datanode}, and the
internal RPC program {Pfs_dn_internal.datanode_ctrl}.
server implements the public RPC program {!Pfs_datanode.datanode}, and the
internal RPC program {!Pfs_dn_internal.datanode_ctrl}.
- I/O processes of datanode: These are internal worker processes controlled
by the main datanode server. The implement the internal program
{!Pfs_dn_internal.datanode_io}.
......@@ -31,27 +31,32 @@ completely optional.
There is right now nothing that prevents clients from calling internal
RPC programs.
{2 [Coodination]: Finding the coordinator}
{2 [Coordination]: Finding the coordinator}
The first step of a client is to find the coordinator. The client may
have one or several namenode ports, however, only a certain port is
know one or several namenode ports, however, only a certain port is
the right one to send the namenode queries to. At cluster startup, the
namenodes elect a coordinator. Only the coordinator can respond to
client queries.
The program {!Pfs_nn_coord.coordination} includes
functions to find the coordinator:
- [find_coordinator]: This returns the host and port of the
coordinator. Every namenode server can return this information.
- [is_coordinator]: This just returns whether this is already the
namenodes elect a coordinator. Only the coordinator can actually
respond to client queries. However, all namenode servers create the
namenode socket, and RPC requests can be sent to them. The
non-coordinators will emit errors if they get a query only the
coordinator is able to respond.
The program {!Pfs_nn_coord.coordination} (reachable via the main
namenode port) includes functions to find the coordinator. All
namenode servers can respond to these queries:
- {!Pfs_nn_coord.find_coordinator}: This returns the host and port of the
coordinator.
- {!Pfs_nn_coord.is_coordinator}: This just returns whether this is already the
port of the coordinator.
From a client's perspective it is unpredictable which host becomes
the coordinator. Usually, clients just call [find_coordinator] to
get the information. Long-running clients may call [is_coordinator]
every time they establish a new TCP connection to a namenode to
validate whether this is still the coordinator.
From a client's perspective it is unpredictable which host becomes the
coordinator. Usually, clients just call
{!Pfs_nn_coord.find_coordinator} to get the information. Long-running
clients may call {!Pfs_nn_coord.is_coordinator} every time they
establish a new TCP connection to a namenode to validate whether this
is still the coordinator.
{2 The [Filesystem] program}
......@@ -69,9 +74,9 @@ because it means once you get an inode ID at hand, it is a permanent
identifier for the same object. It is not possible that a parallel
accessing client changes the inode ID so it points to a different file.
Unlike in the normal Unix filesystem interface, the PlasmaFS protocol
Unlike the normal Unix filesystem interface, the PlasmaFS protocol
returns the inode ID's to the user, and the user can also access files
by this ID. This is even the primary method of doing this.
by this ID. This is even the primary method of doing so.
An inode can exist without filename - but only for the lifetime of a
transaction. When the transaction is committed, inodes without
......@@ -104,16 +109,26 @@ The sequence number can be thought as a version number of the contents,
and may e.g. be useful for quick checks whether content has changed
compared to some previous point in time.
{4 Operations}
- {!Pfs_nn_fsys.allocate_inode}: Create a new inode
- {!Pfs_nn_fsys.get_inodeinfo}: Read the {!Pfs_types.inodeinfo} struct
- {!Pfs_nn_fsys.update_inodeinfo}: Change the {!Pfs_types.inodeinfo} struct
- {!Pfs_nn_fsys.delete_inode}: Delete an inode
{3 Transactions}
All metadata and data accesses are done in a transactional way. (There
are a few exceptions, but these are quite dangerous.) This means the
client has to open a transaction ([begin_transaction]) first, and
has to finish the transaction after doing the operations (either via
[commit_transaction] or [abort_transaction]). A transaction sees
immediately what other transactions have committed, i.e. we generally
have a "read committed" isolation level. (An important exception is
explained below.)
client has to open a transaction ({!Pfs_nn_fsys.begin_transaction})
first, and has to finish the transaction after doing the operations
(either via {!Pfs_nn_fsys.commit_transaction} or
{!Pfs_nn_fsys.abort_transaction}). A transaction sees immediately what
other transactions have committed, i.e. we generally have a "read
committed" isolation level. (An important exception is explained
below.) ({b Note}: While I write this I realize that we already have
a "repeatable read" isolation level for [inodeinfo]. It is likely that
this issue needs some cleanup.)
The client may open several transactions simultaneously on the same
TCP connection. A {i transaction ID} is used to identify the transactions.
......@@ -124,7 +139,7 @@ aborted. This protects server resources.
From the client's perspective the transactions look very much like
SQL transactions. There are some subtle differences, though:
- The way competing accesses to the same piece of data are handled.
Generally, PlasmaFS uses a pessimistic locking scheme - locks are
Generally, PlasmaFS uses a pessimistic concurrency scheme - locks are
acquired before an operation is tried. When the locks cannot be
acquired immediately, PlasmaFS aborts the operations and returns
the error [ECONFLICT]. It does not wait until the locks are free
......@@ -157,6 +172,14 @@ transactions continuously open, but these are only for reading data.
During a commit, another PostgreSQL transaction is used for writing
data. On PostgreSQL level, no conflicting updates can occur anymore.
{4 Operations}
- {!Pfs_nn_fsys.begin_transaction}: Start a transaction
- {!Pfs_nn_fsys.commit_transaction}: Make all changes permanent and
finish the transaction
- {!Pfs_nn_fsys.abort_transaction}: Undo all changes and finish the
transaction
{3 Directories}
PlasmaFS stores directories not as a special kind of file, but in the
......@@ -182,7 +205,14 @@ There is no symlink resolution in the PlasmaFS server yet. Other than
that, symlinks work already. (The NFS bridge supports symlink
resolution.)
{3 Example 1: Create file}
{4 Operations}
- {!Pfs_nn_fsys.lookup}: Find an inode by file path
- {!Pfs_nn_fsys.link}: Create a file path for an inode
- {!Pfs_nn_fsys.unlink}: Remove a file path for an inode
- {!Pfs_nn_fsys.list}: List the contents of a directory
{4 Example 1: Create file}
This transaction creates a new file by allocating a new inode, and
then linking this inode to a filename:
......@@ -195,7 +225,7 @@ link(tid, "/dir/filename", inode);
commit_transaction(tid);
]}
{3 Example 2: Rename file}
{4 Example 2: Rename file}
There is no rename operation. However, one can easily get exactly the
same effect by:
......@@ -264,6 +294,14 @@ becomes an issue, because data can leak from files a user must not
have access to. This will be fixed when we introduce access control
to PlasmaFS.
{4 Operations}
- {!Pfs_nn_fsys.allocate_blocks}: Allocate new or replacement blocks
for an inode
- {!Pfs_nn_fsys.free_blocks}: Free blocks of an inode
- {!Pfs_nn_fsys.get_blocks}: Get the blocks of an inode (i.e. get
where the blocks are stored)
{3 Example of block allocation}
Here, we allocate 10 blocks at block position 0 of the file. If there
......@@ -320,11 +358,310 @@ they cannot be immediately deallocated. Instead, the blocks enter a
special transitional state between "used" and "free". They are no longer
in the blocklist of the inode, but they cannot be reclaimed immediately
for other files. The blocks leave this special state when the last
transaction finishes that accesses these blocks.
transaction finishes accessing these blocks.
{3 What about EOF?}
As you see, the client extends files block by block. Well, not every
file has a length that is a multiple of the block size. How is that
solved?
There is an EOF position in the {!Pfs_types.inodeinfo} struct. This is
simply a 64 bit number. The convention is now that clients update and
respect this EOF position, i.e. when more data is appended to the file,
the EOF position is moved to the position where the data ends logically,
and readers discard the part of the last block that is beyond EOF.
Note that this is only a convention - it is not enforced. This means
we can have files whose EOF position is unrelated to where the last
block is (i.e. a position before or after this block). In such cases,
clients should give EOF precedence, and treat non-existing blocks as
null blocks.
{3 The [Datanode] program}
There is also some help for finding out what the last block is.
In {!Pfs_types.inodeinfo} the field [blocklimit] is the "block
EOF", i.e. the file has only blocks with index 0 to
[blocklimit-1]. This field is automatically maintained with
every block allocation or deallocation.
{3 The ticket system}
{2 The [Datanode] program}
{3 Reading data in a non-transactional way}
The {!Pfs_datanode.datanode} program provides the operations for
reading and writing blocks. The program supports several transport
methods - right now an inline method (called [DNCH_RPC]) and a method
where the block data are exchanged over POSIX shared memory (called
[DNCH_SHM]). See the next section for how to use shared memory -
let us first focus on [DNCH_RPC].
The port the [Datanode] program listens on is returned in the field
[node] of the {!Pfs_types.blockinfo} struct, so there is no additional
communication overhead for finding this out.
For writing data, one has to call {!Pfs_datanode.write} as in
{[
write(block, DNCH_RPC(data), ticket_id, ticket_verifier)
]}
Here, [block] is the block number, and the ticket numbers come
from the {!Pfs_types.blockinfo} struct (as returned by
{!Pfs_nn_fsys.allocate_blocks}). With the notation [DNCH_RPC(data)]
it is meant that the [DNCH_RPC] branch of the union is selected,
and [data] is the string field put into this branch. For writes,
this string must be exactly as long as one block.
For reading data, one has to call {!Pfs_datanode.read} as in
{[
r = read(block, DNCH_RPC, pos, len, ticket_id, ticket_verifier)
]}
Again, [block] is the block number. Because [read] supports partial
block reads, one can also pass a position and length within the
block ([pos] and [len], resp.). The ticket information comes from
{!Pfs_nn_fsys.get_blocks}. (Right now it is not evaluated - read
access is always granted.)
The [read] operation returns a value like [DNCH_RPC(data)].
The [Datanode] program can handle multiple requests simultaneously.
This means clients can send several requests in one go, without
having to wait for responses. Clients should avoid to create
several TCP connections to the same datanode to save resources.
The [Datanode] program tries to be as responsive to clients as
possible. This especially means all requests sent to it are
immediately interpreted, although this usually means they are only
buffered up until the I/O operation can actually be done. Right now,
there is even no upper limit for this buffering. Clients should try
not to send too many requests to [Datanode] at once, i.e. before the
responses arrive. A good scheme for a client is to limit the number of
outstanding requests, e.g. to 10. It is likely that the protocol
will be refined at some point, and a real limit will be set. Clients
can then query this limit.
{4 Operations}
- {!Pfs_datanode.read}: Read a block
- {!Pfs_datanode.write}: Write a block
- {!Pfs_datanode.copy}: Copy a block
- {!Pfs_datanode.zero}: Fill zero bytes into a block
- {!Pfs_datanode.sync}: Synchronize with disk
{3 Using [Datanode] with shared memory}
In the interfaces of {!Pfs_datanode.read} and {!Pfs_datanode.write} it
is easy to request shared memory transport. Just use [DNCH_SHM]
instead of [DNCH_RPC], and put a {!Pfs_types.dn_channel_shm_obj}
struct into it. This struct has fields for naming the shared memory
file, and for selecting the part of this file that is used for
data exchange. The [read] call will then put the read data there,
and [write] will expect there the block to write. Of course, the
datanode server must have the permission to access the shared
memory file (it always requests for read/write access).
There are two difficulties, though. First, it is required that the
client opens a connection to the Unix Domain socket the datanode
server provides - the TCP port will not work here. Second, it may be
difficult for the client to manage the lifetime of the shared memory
file. For both problems, the datanode server has special support
RPC's.
The function {!Pfs_datanode.udsocket_if_local} checks whether it is
called from the same node as the datanode server runs on, and if so,
the path of the Unix Domain socket is returned. This "same node check"
is done by comparing the socket addresses of both endpoints of the
incoming call - if both addresses have the same IP address, it is
concluded that this is only possible on the same node (i.e. the
criterion is whether [getsockname] and [getpeername] return the same
IP address). Clients should call {!Pfs_datanode.udsocket_if_local}
on the usual TCP port of the datanode when they want to use the
shared memory transport method, and if they get the name of the
Unix Domain socket, they should switch to this connection.
The function {!Pfs_datanode.alloc_shm_if_local} can be used to
allocate shared memory so that the lifetime is bound to the current
connection. This means this shared memory will be automatically
deallocated when the client closes the connection to the datanode
server. This is very practical, as there is no easy way to bind the
lifetime of shared memory to the lifetime of another system resource
(say, a process, or a file descriptor).
{4 Operations}
- {!Pfs_datanode.udsocket_if_local}: Determine Unix Domain socket
- {!Pfs_datanode.alloc_shm_if_local}: Allocate shared memory
{4 Speed of shared memory transport}
The overhead of the shared memory transport is quite low - especially,
it can be avoided that the data blocks are copied on the path to or
from the disk. This allows it to do local I/O at full disk speed.
However, one should also realize that the latency of the data path is
higher than when directly writing to a file. To compensate for that
it is suggested to submit several requests at once to the datanode
server, and to keep the server busy.
For getting maximum performance one should avoid using shared
memory buffers that are not page-aligned. Also, one should avoid to
write directly to a buffer. It is better to get a file descriptor
for the buffer and to write to it via the Unix [write] system call.
The kernel can then play nice tricks with the page table to fully
avoid data copying.
In comparison to accessing local files, there is of course also the
overhead of the central namenode. For reading or writing large files
this overhead can be neglected.
All in all, the performance of the shared memory transport is good
enough that Plasma MapReduce stores all data files, even local ones,
in the distributed filesystem.
{2 The ticket system}
The tickets are created by the coordinator, but are checked by the
datanodes. This means there must be some additional communication
between the coordinator and the datanodes.
Before going into detail, let us first explain why we need this. Of
course, at some point PlasmaFS will allow to restrict access to data,
and the ticket system is a way to extend the scope of an authorization
system to loosely attached subsystems like datanodes. Right now,
however, there are no access restrictions - every client can read and
write everything. It is nevertheless useful to already implement the
ticket system, at least for write access. The ticket system helps us
to enforce the integrity condition that blocks can only be written
from inside transactions. Clients could otherwise e.g. allocate
blocks, commit the transaction, and write the blocks later. This is
dangerous, however, because at this moment there is no guarantee that
the blocks are still associated to the file the client thinks they
are.
In order to avoid that the coordinator has to notify the datanodes
about the access permissions for each block separately, a
cryptographic scheme is used to lower the overhead. At transaction
start, the coordinator creates two numbers:
- [ticket_id]: This is the user-visible ID of the ticket
- [ticket_secret]: This is a random number
Both numbers are transmitted to each datanode. When the transaction is
finished, the datanode servers are notified that the tickets are
invalid. This communication path right now uses the normal datanode
port, and the coordinator calls the special
{!Pfs_dn_internal.datanode_ctrl} program on it.
There is also a timeout for every ticket - for catching the rare and
mostly hypothetical case that the datanode server is unreachable
in the moment the coordinator wants to revoke a ticket, but is
back up later.
The {!Pfs_datanode.read} and {!Pfs_datanode.write} calls now do not
take the [ticket_secret] as arguments, but a [ticket_verifier]. The
verifier is a message authentication code (MAC) for the combination of
[ticket_id], [block] number, and the permission (read-only or
read-write). This means the verifier is a cryptographic hash computed
from [ticket_id], [block], permission, and [ticket_secret]. Clients
cannot compute it because they do not know the secret.
{2 Reading data in a non-transactional way}
Especially designed for the NFS bridge, there is a way of reading data
blocks outside transactions. Right now, this method does not require
any additional privilege, but in the future this will be changed,
because the client needs to be trusted.
It is possible to pass 0 for the [ticket_id] and [ticket_verifier] in
the {!Pfs_datanode.read} call. This returns the contents of the data
block outside a transaction. However, there is absolutely no guarantee
that this block is still entered into the blocklist of the file
when the [read] call responds.
The block can be used for something else when other transactions
change the file, free the block, and allocate it for a different
file. Such modifications can be done quite quickly - sometimes faster
than reading a single data block, so this is a real issue.
Because of this, the client has to check after [read] whether the
block stores still the data the client expects to be there. This check
can be done in an expensive manner by opening a new transaction,
issuing a {!Pfs_nn_fsys.get_blocks} request, and checking whether the
read block is still used at the same place in the file. If the block
is no longer there, the client has to start over, or to fall back to
the normal method of reading this block from inside a transaction.
There is a possible optimization making the non-transactional method
attractive. The check can be sped up whether the block is still
allocated for the same file, and the same position in the file. In
order to help here, the coordinator provides a special service called
the {i inode cache}. The inode cache is provided on an extra port and
is implemented by a special server program. The RPC program the client
has to call is {!Pfs_nn_inodecache.inodecache}.
The inode cache port can be determined by invoking
{!Pfs_nn_coord.find_inodecaches} on the normal coordinator port.
As already explained earlier, the coordinator maintains a sequence
number for every inode that is automatically increased when new blocks
are allocated or blocks are freed. The sequence number gives us a
quick criterion whether blocks have changed. Of course, it is possible
that block allocations have taken place that did not affect the block
we are interested in. However, we consider this as a case that does
not happen frequently and that we are not optimizing for. We only
interpret the case that the sequence number is still indentical after
the [read] because we can conclude then that the block is valid.
The inode cache now provides a way to quickly check exactly that.
It has a function {!Pfs_nn_inodecache.is_up_to_date_seqno} that in many
cases immediately knows whether the sequence number is still the same.
Callers should be prepared, though, that this function returns [false],
even sometimes when the sequence number has not changed (because this
piece of information is not in the cache).
So, a client can read the blocks of a file by doing:
{[
/* Phase 1. Get the blocks */
begin_transaction(tid);
bl = get_blocks(tid, inode, index, len);
commit_transaction(tid);
/* Phase 2. Read the blocks */
data[0] = read(bl[0]);
if ( not is_up_to_date_seqno(bl[0].seqno)) <fall_back_to_alternate_method>;
data[1] = read(bl[1]);
if ( not is_up_to_date_seqno(bl[1].seqno)) <fall_back_to_alternate_method>;
...
data[n-1] = read(bl[n-1]);
if ( not is_up_to_date_seqno(bl[n-1].seqno)) <fall_back_to_alternate_method>;
]}
There are a few more things a client can do: First, it can cache the
[blocklist] array for a file. This reduces the frequency of the
expensive [get_blocks] call, especially if only one or a few blocks
are read every time this routine is called. In the
"<fall_back_to_alternate_method>" case, the blocklist would be deleted
from the blocklist cache.
Second, there is the possibility of omitting the [is_up_to_date_seqno]
calls for all blocks read in a sequence except the last. The last call
is the important - it approves that all blocks are valid up to this
point in time. There is the downside that the risk becomes higher that
blocks are read in vain, i.e. blocks are read that turn out not to be
up to date.
For the NFS bridge this non-transactional method of reading files
turned out to be a relevant optimization. This is mainly the case
because the NFS protocol does not know the concept of a transaction,
and because NFS requests each block individually. There is
practicallay no chance to do several reads in one go, and no chance
for getting a natural speedup by combining several reads in a single
transaction. The non-transactional read method helps because we often
only need one extra [is_up_to_date_seqno] after every [read], and
[is_up_to_date_seqno] is very cheap.
/* $Id$ -*- c -*- */
/** {1:datanode [Datanode]} */
/** Datanode access.
*/
#include "pfs_types.x"
#ifndef PFS_DATANODE_X
......@@ -7,63 +12,105 @@
program Datanode {
version V1 {
/** {2 [null] } */
void null(void) = 0;
/** {2 [identity] } */
longstring identity(longstring) = 1;
/* Returns the identity of this node (an ID which is assigned anew
/** Returns the identity of this node (an ID which is assigned anew
when the datanode is initialized). The arg is the clustername.
If the node belongs to the wrong cluster, this RPC must return
SYSTEM_ERR.
[SYSTEM_ERR].
*/
/** {2 [size] } */
hyper size(void) = 2;
/* Returns the number of blocks. The blocks have numbers from 0
to size-1
/** Returns the number of blocks. The blocks have numbers from 0
to [size-1]
*/
/** {2 [blocksize] } */
int blocksize(void) = 3;
/* Returns the blocksize */
/** Returns the blocksize */
/** {2 [clustername] } */
longstring clustername(void) = 4;
/* Returns the clustername */
/** Returns the clustername */
/** {2:read [read] } */
dn_channel_rd_data read
(dn_channel_rd_req, hyper, int, int, hyper, hyper) = 5;
/* Reads a block, or a part of it:
read(req, block, pos, len, st_id, st_vfy)
/** [read(req, block, pos, len, st_id, st_vfy)]:
Reads a block, or a part of it. [req] defines how the data
is passed back to the caller (see docs for [dn_channel_rd_req]
in {!Pfs_types}). The [block] is the block number of this
datanode. [pos] and [len] select a substring of this block.
[st_id] and [st_vfy] are the safetrans ticket returned by
[get_blocks].
Right now this ticket is not checked.
*/
/** {2:write [write] } */
void write(hyper, dn_channel_wr_data, hyper, hyper) = 6;
/* Writes a block. It is only possible to write a block completely.
Call: write(block, contents, st_id, st_vfy): Writes [contents]
to [block]. [contents] must have the length [blocksize].
[st_id] is the safetrans ID, and [st_vfy] is the verifier
as returned by the namenode.
/** [write(block, contents, st_id, st_vfy)]:
Writes a block. It is only possible to write a block completely.
The [block] is the block number of this datanode.
In [contents] the data to write is passed. (See the docs
for [dn_channel_wr_data] in {!Pfs_types} for details.)
The data in [contents] must have the length [blocksize].
[st_id] is the safetrans ID, and [st_vfy] is the verifier
as returned by the namenode.
The safetrans ticket {i is} checked!
*/
/** {2:copy [copy] } */
void copy(hyper, longstring, longstring, hyper,
hyper, hyper, hyper, hyper) = 7;
/* Copies a block, possibly to a remote system:
copy(block, dest_node, dest_identity, dest_block, st_id, st_vfy,
dest_st_id, dest_st_vfy).
If dest_identity is equal to the own identity, this is a local
copy. Otherwise, dest_node is interpreted as "host:port", and the
/** [copy(block, dest_node, dest_identity, dest_block, st_id, st_vfy,
dest_st_id, dest_st_vfy)]:
Copies a block, possibly to a remote system. [block] identifies
the block on this datanode. [dest_node] is the datanode server
to where the block is written to. [dest_identity] is the
identity of the destination server. [dest_block] is the
block number of the destination server.
If [dest_identity] is equal to the own identity, this is a local
copy. Otherwise, [dest_node] is interpreted as "host:port", and the
block is written to the remote host.
st_id, st_vfy: as in [read]
dest_st_id, dest_st_vfy: as in [write]
[st_id], [st_vfy]: as in [read]
[dest_st_id], [dest_st_vfy]: as in [write]
*/
/** {2:zero [zero] } */
void zero(hyper, hyper, hyper) = 8;
/* Fills a block with zeros: block, st_id, st_vfy */
/** [zero(block, st_id, st_vfy)]:
Fills a block with zeros
*/
/** {2:sync [sync] } */
void sync(void) = 9;
/* Waits until the next sync is done */
/** Waits until the next sync cycle is done */
/** {2:alloc_shm_if_local [alloc_shm_if_local] } */
longstring_opt alloc_shm_if_local(void) = 10;
/* If the client is on the same node, this RPC allocates a new
/** If the client is on the same node, this RPC allocates a new
POSIX shm object, and returns the path of this object.
The object has zero size, and is created with mode 666.
If the client is not on the same node, the RPC returns NULL.
......@@ -75,8 +122,10 @@ program Datanode {
insecure.
*/
/** {2:udsocket_if_local [udsocket_if_local] } */
longstring_opt udsocket_if_local(void) = 11;
/* If the client is on the same node, this RPC may return the
/** If the client is on the same node, this RPC may return the
name of a Unix Domain socket to contact instead.
*/
......
/* $Id$ */
/* $Id$ -*- c -*- */
/** Internal stuff.
*/
#ifndef PFS_DN_INTERNAL_X
#define PFS_DN_INTERNAL_X
#include "pfs_types.x"
/* The Datanode_ctrl program is only invoked by the namenode */
/** {1:datanode_ctrl [Datanode_ctrl]} */
/** The [Datanode_ctrl] program is running on each datanode, but
only invoked by the coordinator to push and revoke safetrans tickets.
*/
program Datanode_ctrl {
version V1 {
/** {2 [null] } */
void null(void) = 0;
/** {2 [reset_all_safetrans] } */
void reset_all_safetrans(void) = 1;
/* Reset all safetrans */
/** Revokes all safetrans tickets. This is called when the coordinator
starts up.
*/
/** {2 [cancel_safetrans] } */
void cancel_safetrans(hyper) = 2;
/* Cancel the safetrans of this st_id */
/** Cancel the safetrans ticket with this [st_id] */
/** {2 [safetrans] } */
void safetrans(hyper, hyper, hyper) = 3;
/* safetrans(st_id, st_tmo, st_secret) */
/** [safetrans(st_id, st_tmo, st_secret)]: Enables all safetrans
tickets with ID [st_id]. The secret [st_secret] is used
for securing the ticket system.
*/
} = 1;
} = 0x8000d002;
/* Datanode_io is internally used by the datanode: */
/** {1:datanode_io [Datanode_io]} */
/** The [Datanode_io] program is running in the I/O processes of the
datanodes
*/
program Datanode_io {
version V1 {
/** {2 [null] } */
void null(void) = 0;
/** {2 [read] } */