EnduraData EDpCloud Cross Platform File Replication and Transfer Solutions(eddist.cfg)
Check for Updates of your file sync and replication software from www.enduradata.com

EnduraData XML Configuration

eddist.cfg


NAME

eddist.cfg - Configuration file for EDpCloud's File Synchronization, Content Distribution and file replication Suite.

SYNOPSIS

EDpCloud uses eddist.cfg as its main configuration file. EDpCloud allows users to sync files and folders (unstructured data) to one or more remote locations. EDpCloud can be installed on physical or virtual machines. EDpCloud can be used to synchronize data, as a backup solution (data protection and data recovery), to share data between multiple servers and geographic locations. eddist.cfg can be tailored and customized to a granuar control of your file and folder synchronization. This document explains how to do that.

The eddist.cfg is an XML configuration file for EnduraData's File sync, file replication and Content Distribution Suite. The same file sync syntax is also used For EnduraData Remote Directory Synchronization Suite. The file sync and replication software is designed to be configured using EnduraData's GUI or a text editor such as vi, notepad, etc. The elements and keywords that constitute the file sync configuration are discussed here. Certain elements are required, others are optional. The configuration file allows the user to define what content to synchronize, replicate and distribute, where to send it and where to store it. A file synchronization configuration can be as simple or as complex as you want it to be. A configuration is a logical way to organize and manage content distribution and replication between multiple systems and geographic regions. You can create configurations for localhost to localhost, localhost to one or more remote hosts, many remote hosts to a single host or many remote hosts to many remote hosts. You will find a few examples in this document to help you get started. The following few sections will serve as a tutorial to configure EDpCloud for data recovery, data protection and to distribute all sorts of data between business processes and sync files between systems and geographic sites.

The etc directory contains a few examples that illustrate the syntax of the configuration file.

APPLYING THE CONFIGURATION

Once you have created a configuration you need to:

a. Verify that the configuration is valid, by executing edverify
b. Copy the configuration to $ED_BASE_DIR/etc/eddist.cfg on all nodes.
c. For Windows users, execute edrestartall
d. For Linux/Mac/Unix users execute edpcloud.sh startall

Take a look in the bin directory and you will find a wealth of commands that can be used.

note: $ED_BASE_DIR is where you installed EDpCloud. For Unix, please use /usr/local/enduradata/edpcloud. For Windows, please use c:\enduradata\edpcloud

SECURITY WARNING

$ED_BASE_DIR must be readable and writable only by the owner because it contains sensitive information.

TERMS AND DEFINITIONS

EnduraData's File synchronization, replication and Content Distribution Suite uses a few terms that we will define here:

configuration: A logical organization of sync and content distribution network. An Enterprise may have one or more configurations. Each configuration has (a) one or more senders (b) one or more receivers and (c) one or more links. A configuration is identified by its name.

sender: A sender (source) is a node or a device that sends data to one or more servers. The sender has a lot of parameters that we will define in subsequent sections. A sender is identified by its hostname and by the name of the parent link.

receiver: A receiver (destination, target) is a node that receives data from one or or more senders(file sync and file replication receiver). The receiver has a lot of parameters that we will define in subsequent sections. A receiver is identified by its hostname and by the name of the parent link. Duplicate receivers are not allowed in the same link. See counter examples for information. The maximum number of receivers depends on the version of the software you licensed. The personal and home editions are limited to one receiver only.

The sender and receivers host names must be resolved using DNS, NIS or the hostfiles. If this cannot be done, you may use IPs or create aliases under the etc directory.

link: A link is a logical grouping of one and only one data sender or source and one or multiple file and data receivers(remotes, targets, destinations). All file sync receivers in a link have one common file sender. The link has a name which is unique across the configuration. A link is identified by its name. A configuration has one or multiple links. The maximum number of file replication links depends on the version of the software you licensed. The personal and home editions are limited to one link only.

A sender can belong to one or more links. A receiver can also belong to one or more links.

Senders and receivers can be identified with hostnames, IP or fully qualified names. A sender must be able to connect to a receiver and a receiver must be able to connect to a sender.

Example:

link1: sender1 sends data to receiver1
link2: sender1 sends data to receiver2
link3: sender1 sends data to receiver1, receiver2, receiver3

Please remember:

You can use different links to control what is sent, what is accepted, where it is sent and when it is sent.

FILE SYNC CONFIGURATION FORMAT AND SYNTAX

Each section in eddist.cfg inherits its defaults from the parent nodes (except for the name and hname). Therefore you can define some global parameters and they will be shared by all sections of your file synchronization configuration. You can override the global parameters by specifying new ones inside the new sections. Sibling parameters are not inherited, but parent parameters are inherited.

The configuration parameters use the format: paramname="value".

The following is an example of a simple configuration:
 
	<?xml version="1.0" encoding="UTF-8"?>
	  <config name="simple" password="linkpassword" >   
                
  <link name="tokyo" >   

  <sender hostname="tokyo.enduradata.com" />   

  <receiver hostname="london.enduradata.com"   

    storepath="/home/reports/london"   

  />   
        
  </link>   

  </config>   

In this example, the configuration's name is simple. The configuration has only one link. This link is identified by the name=tokyo. The link has one sender identified by hostname="tokyo.enduradata.com". The link has one receiver identified by hostname="london.enduradata.com". Notice the other required parameter: storepath. This parameter is described in the next section. Tokyo defines a link that synchronizes data from tokyo.enduradata.com to london.enduradata.com. The synchronized data on london.enduradata.com will be stored under /home/reports/london. Make sure the hostnames can be resolved otherwise use IP addresses. Also please make sure you can perform reverse lookup. If you cannot perform reverese lookup, thean read about myaliases in the doc directory.

name="link_name": This is a required parameter for the config and link sections. The value assigned to name designates the name of the link or of the configuration. Link names must start with an alphabetic character. The name all is reserved and should not be used as a link name. The valid characters in a link name are only in the sets: [a-z][0-9].

Sender and receiver host names

hostname="hostname_or_ip" or hname="hostname": This is a required parameter for both the source and the receiver. It is critical that you use the fully qualified hostname. To be certain, use the name returned by the command line hostname. Please see alias parameter for additional information. For IPv6 addresses, at times, IPv6 addresses must be substituted with hostnames and at times the interface must be specified. Therefore, if you run into problems with IPv6, try using hostnames rather than addresses. This is valid for all replication utilities and configurations.

Additional aliases for localhost can be put in $ED_BASE_DIR/etc/eddist.cfg

Filtering based on file and directory name patterns.

You can regular expressions to filter what is sent from the the source and what is accepted by the target. Simple regular expressions can be usued using includes. But we recommend putting a list of regular expressions in includes (What to include) and in excludes (what to exclude).

include="regular expression": This is a regular expression that lists what patterns are included in the files and folders synchronized from tokyo.enduradata.com to london.enduradata.com. Readers not familiar with regular expressions will need to read a regular expressions tutorial. The default for this parameter is everything or ".*". If no include parameter is used, the includes file in etc is used.

The following are some examples of some regular expressions:

^/home/.*: means everything under /home
(^/home/.*)|(^/data/.*): means everything under /home or everything under /data.

By default EDpCloud uses a in $edpcloud/etc/includes.

include="regular_expressions_to_include": This is a required parameter for both the source and the receiver. This is a POSIX regular expression. The include parameter has expressions that will have to be matched by file (directory, link, ...) names in order to be included in the list to be synchronized.

exclude="regular_expressions_to_exclude": This is a required parameter for both the source and the receiver. This is a POSIX regular expression. The exclude pattern has a list of name patterns to be excluded from synchronization. The default exclude value is nothing.

The data distribution suite, examines the include regular expression and exclude regular expression. Only data matched by the include pattern locally and by the include pattern remotely and not matched by the exclude patterns of the local and remote systems will be sent by the local node and accepted by the remote node.

incfile="filename_containing_patterns_to_include": This is the name of a file that has a list of regular expressions to be ORed and included in the synchronization. Use this if you have a huge list of expressions or if a work flow application generates a list of patterns or file names for you. There is one experession per line. All lines are ORed using '|' to form a larger expression. Use incfile for complex or very long expressions. Please note that the file is a relative file name within etc directory. Do not use absolute file names.

--

  ^/home/.*

  ^/usr/.*

  ^/disk1/daily/.*|^/disk1/pricing/3pm/.*|/disk2/pricing/4pm/.*

  ^/govsecurities/3mo/.*

  ^/libor/banque_de_paris/.*

  ^/libor/banque_de_lyon/.*

  ^/libor/ustreas_notes/.*

exfile="excludefile": This is the name of a file that has a list of regular expressions to be ORed for exclusion. Use this if you have a huge list of expressions or if a work flow application generates a list of patterns or file names for you. exfile has the same format as incfile. EDpCloud comes with exclude patterns in a file called excludes. If at first your system is not replicating some files then examine the patterns installed by default in $edpcloud/etc/excludes. Please note that the file is a relative file name within etc directory. Do not use absolute file names.

filterlinks="0|1": This parameter if set to 1, instructs the system to apply includes/excludes to symbolic links. The default is 0.

Redoing excludes in the sender side

redoexcludes="0|1": When set the sender reprocesses the include and exclude regular expressions a second time (in case administrators) changed the includes/excludes after a job was launched. The default is 0. Setting it to 1 adds more costs to the run time. This parameter must be set in the receiver section.

Destination path where data will be stored.

storepath="destination directory name": This is the top level directory where any thing received will be stored. In the previous configuration example, any data sent from /home will be stored under /home/reports/london/home. The receiver server will not work unless this directory exists a priori. Therefore, you will need to create it before you start sending the content to the receiving node unless you setup an environment variable called ED_CREATE_STOREPATH. storepath and rootpath are interchangeable. storepath may include some patterns that will be resolved by the receiving nodes. Patterns are enclosed within the percent sign. The following macro patterns are supported for the entreprise versions of the software. They are not supported in the personal or home editions of the software. Macros names are case sensitive and are resolved at run time on the receiver side.

mustexist="TOP directory name name": This is the top level directory that must exist on the the receiver. This optional parameter is useful when using removable media such as usb drives. It is useful to avoid filling the root file system if a drive has been removed and the option to force the creation of the storepath is one via createpath file in etc. Replication will fail if the top directory does not exist.

Using openssl certificates for authentication

The entire EDpCloud can use SSL certificates if ED_USE_CERT environment variable or the registry variable usecerts are set to Y or 1 or if ED_BASE_DIR/etc/always_use_ssl_certs file is present (used for defaults after upgrades). You can also use SSL certs only for the remote transport if <B usesslcert="1"> is set in eddist.cfg.

It is advised to test without SSL first. eddist.log, ed_receiver*log and ed_sender*log will have useful log information in case of failure.

When using SSL certificates, edpcloud expects them under ED_BASE_DIR/etc/certs. The following certificates information files are expected:

edpcloud/etc/certs/ca/ca_cert.pem : The certificate authority chain directory.
edpcloud/etc/certs/public/public_cert_hostname.pem : The public key for hostname (hostname can be an IP). For IPV6, replace ":" and "%" with "."
edpcloud/etc/certs/private/private_cert_hostname.pem : The private key for hostname (hostname can be an IPv4 or IPv6 address but for IPv6 hostnames work better). For IPV6, replace ":" and "%" with "."

if private_cert.pem or public_cert.pem without hostname or without IP exist, they will be used if the the cert file for the hostname/IP are not found. But it is advisable to to label the certs with the hostname or IP.

The ca_cert.pem must be common between the sender and the receivers.

Example of using ssl certificates

EDpCloud we will look for the certificates under the following directories. The file may have the host name or IP's as follows (The name or IP can be ommited):

/usrl/local/enduradata/edpclout/etc/keys/ca/ca_cert.pem
/usrl/local/enduradata/edpclout/etc/keys/public/public_cert_10.0.200.1.pem
/usrl/local/enduradata/edpclout/etc/keys/public/public_cert_10.0.200.2.pem
/usrl/local/enduradata/edpclout/etc/keys/public/public_cert_10.0.200.3.pem
/usrl/local/enduradata/edpclout/etc/keys/public/public_cert_www.example.com.pem
/usrl/local/enduradata/edpclout/etc/keys/private/private_cert_10.0.200.1.pem

If the DNS responses are slow, efficiency can be improved by adding the hostnames to the hosts files.

Removing patterns from file names and directories

Before storing a file or directory, the original file name can be changed using the following keywords:

strip="path to strip from file and link names|path2|...": This is an optional parameter. If specified, this path(s) is/are removed from the the original file name. Multiple paths can be separated by a pipe(|) delimeter.

For example if strip="/backups", then "/backups/home/jj/interference/signals.txt" becomes "/home/jj/interference/signals.txt".

for example if strip="c:\d1|c:\d2" then both c:\d1 and c:\d2 will be removed from the destination path. If nothing is left in the path then an error is raised.

The final file sync storage name will be also modified by the value of storepath.

This parameter is useful when restoring data from a remote site previously stored in a storepath other than the ROOT directory.

If you use strip, and you are experiencing sync problems, re-examine your include/exclude regular expressions.

stripdir="1": This is a dangerous parameter if storepath is "/" under unix. It strips the entire directory path from the original file name and piles the files in store path. So if you are replicating /dir1/f1 and /dir2/f1 one of them will be overwriten. Some users use this dangerous feature to extract data from some filers and instruments and may only be a danger for the rest. When stripdir is set to one, only the file name is preserved without the leading path. So if your storepath is something like /incoming and your remote file is /data/pricing/cmo.xls the new file will be stored in /incoming/cmo.xls.

Using macros with the destination path

EDpCloud resolves the following macros at run time. These macros can be embeded as part of the destination directory (storepath). This can be useful when creating snapshots and when replicating multiple sources to a single destination.

%ip% : When found in a storepath, the receiver will substitute %ip% with the senders' IP.
%hostname% : When found in a storepath, the receiver will substitute %hostname% with the senders' hostname.
%home% : When found in a store path, and if a user name(loginname parameter) is supplied in the configuration, the receiver will substitute the user home directory to %home% in the storepath (if the username on the remote system exists). The %home% macro will be expanded to the home directory of the user in loginname (sender section) only if the user exists. This macro applies only to Unix/Linux/Mac systems as of this writing. If you choose to use this pattern, it should be the first parameter in the storepath. The home directory is for the "loginname" given in the sender entry of the link you are replicating data over.
%sender%: This pattern is substituted with the hname of the sender entry.
%receiver%: This pattern is substituted with the hname of the receiver entry.
%link%: This pattern is substituted with the name of the link entry.
%weekday%: This pattern is substituted with current week day (sunday-monday).
%day%: This pattern is substituted with current day of the month.
%date%: This pattern is substituted with current date (yyyymmdd format).
%month%: This pattern is substituted with current month(1-12).
%year%: This pattern is substituted with current year.
%dayofyear%: This pattern is substituted with day of the year (1-365)
%hour%: This pattern is substituted with current hour(0-23)
%minute%: This pattern is substituted with current minute(0-59)
%second%: This pattern is substituted with current second(0-59).
Example of macros with storepath

storepath="%home%/backup/%ip%/%link%" Will resolve to storepath="/home/ika/backup/10.0_19_10/atlanta" if your home dir was /home/ika and the senders IP was 10.0.19.10 and your linkname was "atlanta". storepath must exist on UNIX, LINUX, AIX, MAC or replication will fail.

Cloud providers and third party transport and file sync

Normally EDpCloud uses its own transport layer. But some customers may have requirements to use their own transport layer or their cloud partner transport layer. Please check the manual for ed_senderprovider for more details.

cloudprovider="cloudentryname" This is the entry from third party rclone section when replicating to google, amazon, wasabi and other cloud providers. Check rclone.conf.example in edpcloud etc directory. See ed_senderprovider manual page for more info about this keyword.

cloudparams="Additional paramters to pass to ed_senderprovider" . This specifies what other additional parameters to pass to third party cloud provider utilities when using cloudprovider="providername". This can also be specified using $edpcloud/etc/rc.params

Alternate paths and destinations

Although EDpCloud retries to send the data, this is useful in the destination server cannot be reached.

alt="host1|host2|host3" Will replicate data to host1, or host2 or host3 if it failed to connect to receiver hname. This param applies to the receiver only.

IPv6

eth="IPv6ethernetinterface" This parameter sepcifies the ethernet interface to use for Linux and other OS when using IPv6 link local or link global IPv6.

 
	<?xml version="1.0" encoding="UTF-8"?>
	  <config name="simple" password="0xAdD4D001:foo" >
<link name="o2"  password="foo" workers="2" eth="enp2s0">
    <sender hostname="londonipv6" alias="10.0.200.125"
    />
    <receiver hostname="londonipv6" minfileage="0" rowsize="2048"
         storepath="/tmp/ipv6"
    />
</link>

</config>

 
	<?xml version="1.0" encoding="UTF-8"?>
	  <config name="simple" password="0xAdD4D001:foo" >

<link name="ipv6" password="foo" workers="4" eth="eno1" >

    <sender hostname="fe80::b82f:c958:7e25:ffa8"
    />

    <receiver hostname="fe80::b82f:c958:7e25:ffa8"
                storepath="/tmp/ipv6"
    />

 </link>

 </config>
Important notices

If your DNS and name resolutions cannot resolve the names or recognize the IPv6 address or hostnames: Add the names and IPv6 address to $ED_BASE_DIR/etc/eddist.cfg (one per line. See myaliases doc)

Filtering on minimum and maximum file sizes

minfsize="minsizeinbytes" Files must be at least a file size of minsizeinbytes.

maxfsize="maxsizeinbytes" Files must have a maxium file size of maxsizeinbytes.

using both maxfsize and minfsize will only replicate the files with at least minfsize and maxfsize bytes.

Both minfsize and maxfsize can be specified for each receiver.

Controlling how replication is done based on file size

alwayscopy="1|0": If set, this parameters tells edpcloud to always copy the entire file even if it did not change.

alwayscopylarge="minsize": If set to minsize bytes other than 0, this parameters tells edpcloud to always copy the entire file if it's size is larger than minsize.

alwayscopysmall="maxsize": If set to maxsize bytes greater than 0, this parameters tells edpcloud to always copy the entire file if it's size is less than maxsize.

Dealing with large files and with limited storage on the remote

Replicating large files requires up to 2*(largest file size)*(number of parallel transfers) of free space on the remote. This space is used to rebuild the file. To reduce this amount of space use usetemp keyword as follows.

usetemp="0|1": The default value is 1. If not set when using alwayscopylarge or alwayscopy, the original file is directly overwritten. If set then a temp file is created first then renamed (without clobbering the original content unless the copy is successful). This option is useful when the storage space is very limited and the files are very large. It is better to have additional storage that is equal to 200 percent of the largest file that needs to be replicated to allow both the temp file and the original file to co-exist before the sync is done and the rename takes place.

Dealing with large files and with continuous file modifications

When very large files are being modified in real time, or when any file is being modified continuously and rapidely, it may be useful to throttle replication to reduce the bandwidth because each time the file is replicated, the file system will detect that it changed immediately rendering it dirty and being queued again for replication.

minfileage="doublevalue": The doublevalue is a a minimum age in seconds. The file will not be replicated until doublevalue seconds have elapsed since it's last modification. example: with minfileage="0.001", files will not be replicated unless the last modification of the file was done 1 msec ago. example: with minfileage="60", files will not be replicated unless the last modification of the file was done one minute ago.

File locking

Two parameters can be used to lock files before sending them. By default no file locking is used. minlock="minutes": Attempt to lock a file before sending them. minlock indicates how many minutes we will continue to attempt the file.

seclock="seconds": Attempt to lock a file before sending them. seclock indicates how many seconds we will continue to attempt the file.

Please note that these parameters have a big impact on performance. When set, outside applications that check for file locking status will block until edpcloud releases the lock. It is important to know that these locks work only when external applications resepect the file locking requests.

Server name aliases in the sender section

This is useful when creating a small configuration that will work on a large number of sources to one or more replication destinations. This is also useful if the incoming address is translated due to NAT addressing or firewall. For example a sender IP can be 1.2.3.4 but before it gets to the receiver, it gets translated into something different (like 5.6.7.8).

alias="aliasregularexpressions": This parameter indicates that the sender can also be recognized by the receiver as having another alias. It is a different way of letting the remote user accept connections from an entire domain or IP patterns. This is a regular expression. Using this parameter, you can make a single configuration to use with many senders across an enterprise while referring to the sender's hostname as a localhost only. The alias keyword is supported only in the entreprise edition of the software.

-- The problem above is due to that london's host name resolves to 10.0.200.125 from the hosts file but the incoming address was translated to 10.0.200.246. 10.0.200.246 was not authorised to send data to chicago.

Because we know that 10.0.200.246 is legitimate for incoming connection from london we added it to the alias in the sender section. Please note that the passwords still has to match and the SSL certificates must pass authentication (If SSL certs are enabled)

<link name="o2chicago" password="foo" workers="4" usesslcert="0" >

<sender hostname="london" alias="10.0.200.246"

/>

<receiver hostname="chicago" archive="1"

storepath="/backup/london/backup/%year%/%month%/%day%"

archivedir="/backup/london/archive/%year%/%month%/%day%"

/>

</link>

History, file archives and versions

history="0|1": This parameters controls wheter we log what was sent or not.

rhistory="0|1": This parameter tells the receiver to save history of what was synchronized. The default is 1.

versions="maxversionnumbers": This parameter controls the ability to go back and restore various versions of the file. DO NOT CONFUSE THIS PARAMETER WITH THE XML VERSION 1.0 IN THE HEADER. This parameter has the number of versions of a file to keep around for snapshot purposes and for recovering deleted files. Depending on available storage on the receiver side, you can keep as many versions of a file as you want, as long as you have enough storage for all of them. Older versions of the files are kept in names that match patterns {oldfilename}.enduradata_snapshot.nnn where nnn is the sequence of number 0 to maxversions. When you reach the maximum number of snapshots to keep around, the extension nnn is rotated and the older version is overwritten. To find out the file age, use creation time for sorting. This will allow you to recover older versions of your files and to recover from deletes. The enduradata_snapshot file pattern can be used for cleaning up of older versions using a simple command like find -name "*.enduradata_snapshot.*" -exec some_command {} \;

if versions is negative, edpcloud will keep all previous versions. This is the prefered method, it is more storage efficient but has a little more overhead in terms of computational time. A negative versions is an archive.

archive="0|1": This parameter controls the ability to go back and restore various versions of the file. This parameter is the same as a negative version. Set this parameter to any character and all versions of the file will be archived. The archive file name is derived using the original file name and the MD5 checksum of the file. All files will be in a directory called "enduradata_snapshot" unless archivedir is used.

archiveincxpr="archiveincludepatternfile": archive pattern include file contains regular expressions. If a file name matches the regular expressions then the old version of the file is archived.

archiveexcxpr="archiveexcludepatternfile": archive pattern exclude file contains regular expressions. If a file name matches the regular expressions then the file is excluded from the archive

archiveincfile="archiveincludefile": a file that contains archive pattern as regular expressions. If a file name matches one of the regular expressions then the file is included in the archive

archiveexcfile="archiveexcludefile": a file that contains archive pattern as regular expressions. If a file name matches one of the regular expressions then the file is excluded from the archive

By default, if file archiveexcludes is found under edpcloud etc directory, its content will be used to filter out files to exclude from archival. By default, if file archiveincludes is found under edpcloud etc directory, its content will be used to include files that match the regular expressions in the includes file.

archivedir="dirname": dirname is where the archive files are put. If no dirname is specified then the files will be in enduradata_snapshot that is a sibling of the original file. archivedir can have the following macros (as described in storepath above): %sender%, %receiver%, %link%, %date%, %year%, %month%, %weekday%, %day%, %hour%, %minute%,%dayofyear%, %second%, %ip%, %hostname%.

A catalog of archived files is located in enduradata_farchive_list under archivedir. It shows the original file name and its original md5 and the time of change of the current file in storepath.

If archivedir is a real path then the file will be archived in that path. A good example is to put archives in a directory parallel to storepath. Example archivedir="/home/archives" for Linux or archivedir="d:\archivedata". These files tend to grow and you may want to set a schedule to clean them up when not needed.

Archive files may be safely compressed using gzip or zip or any other compression program that adds ".gz" or ".zip" extention to the file name and a duplicate archive file will not be created. A file is archived only if file_md5 or file_md5.gz or file_md5.zip is not found in the archivedir. A list of the file archives and the modification times is available in the archive dir as well.

Bandwidth throttling

bwmax="maxbytespersecond": This parameter is used for throttling the bandwidth. It is in bytes per second. This is the maximum bandwidth to use when sending file data to a remote node. The maximum bandwidth is on a per receiver. If you have more than one receiver per link, then the maximum bandwidth is the sum of the bandwidth factors used for each receiver. You can specify a different bandwidth for each receiver or the same for all receivers.

Example

bwmax="65536" will give you approximatively 64 Kbytes per second for each receiver in the link.

Resource throttling

throttle="milliseconds": After transmitting a payload, cause the thread that just sent some data to go sleep for the value specified by throttle. This value is in milliseconds. Use this to control how data trickles towards the receiver. This can also be controlled in real time using edjob.

Dealing with millions of small files being modified in a fraction of second

waitjournal="milliseconds": When a large number of small files are modified in a few milliseconds, we will need to give priority to journaling. The transport layer can continue to work but it yields to the journaling briefly to allow data to continue to be journal and keep a balance between journaling and synchronization. Set waitjournal to a value greater than 1.

Failures and retries

EDpCloud continues to retry to send the data to the remote servers. The number of retries can be controlled as follows.

maxfailures="maximumfailures": This parameter is used for the maximum failures allowed. This represents the maximum number of retries edpcloud will try to send your data to a remote site. If for some reason we fail to send the data after maxfailures, the system will not try to send the data after maxfaiures is reached. The default value is 256.

Example

maxfailures="0" : never give up trying to send the data. maxfailures="32" : stop trying to send data to the remote site after 32 failures.

snapfailures="minfailures": (WINDOWS ONL): This parameter indicates the minimum number of failures that will hapen before a VSS snapshot is taken and used to replicated files that failed to replicate minfailures times. This number should be at least 10 to avoid wasting resources. This number should also be way below maxfailures. The default is 20.

minstatfailure="nn": If set to an integer greater than zero, then edstat will now show intermitent failures unless they exceed nn. The default is 5. This is useful when replicating files that are large or that change while being replicated.

Management queues optimization

maxqlen="maxnumpayloads": This is how many payloads (A payload contains many files) that will be queued and made available for the transport layer to send to the remote location. The default value is the number of workers times 6. An optimal value should be between (2*numworkers) and (numworkers*6).

Example

maxqlen="10"

File permissions, ownership by users and group id

EDpCloud replicates all the meta data however, when replicating from Windows to Unix, file ownership and group id's can be controlled as follows:

owneruid="userid": This parameter is used to allow userid to take ownership of the files. ownergid="groupid": This parameter is used to indicate that the files will be part of the UNIX group groupid. ownermode="mode": This parameter is used to indicate that the files will be part of the UNIX group groupid. Note that the mode value is in octal: ie 777, 700 ....

These parameters also allow Linux/Unix to override the original meta data.

owneruid, ownergid and mode can be used on UNIX to override the meta data from the source and change the ownership and mode of the files on the receiver.

Both userid and groupid must exist in the password files. Valid values are numbers rather than the group or user names.

Run in scheduled mode, in real time mode or both

isscheduled="0|1": This parameter applies to the receiver and tells the system that the current sender will accept files manually or scheduled. The default is 1. isrealtime="0|1": This parameter applies to the receiver and tells the system that the current sender will accept files changes in real time. The default is 1.

Controlling the number of parallel outgoing streams

workers="number_of_threads": This parameter is the number of streams to use for sending data to the link. The default is 1. This parameter is valid only for the receiver. This parameter must be tuned to get better performance. A high number may lead to disk contention and network congestion. Adjust it and measure your performance rather than setting it and forgetting about it.

File and directory deletions

By default EDpCloud does not propagate the deletes. Use the following if you want deletes to be replicated as well.

deletes="1|0|Y|n|y|N": This parameter is used to tell the receiver if it should propagate the deletes. When set to one, a delete that happens on the sender side is also applied to the receiver side. The default value is 0 (deletes are not propagated). Users must be careful with this parameter since directories are deleted recursively if they were deleted on the sender side as well. This parameters applies to the real time product only.

forcedelete="1|0": This parameter is used to tell the receiver to force a delete of a protected file or directory. When set to one, the receiver will first change the permissions on the remote before attempting to delete the file or directory in question.

Replicating Posix ACLs and dealing with metadata failures

acls="y": This parameter is used to replicate ACLs. The default is no for Linux, Mac and UNIX. Windows ACLs are replicated by default.

nostatfailure="1|0": If set to 1, then meta data setup failures such as chmod, chown, chrp, acls on the remote will be ignored if they fail. This is useful if someone removes a file or a directory on the remote while it is being replicated.

Managing payload size and priorities

maxpayloadbytes="maxsize": If set to maxsize bytes greater than 0, this parameters tells edpcloud batch files with a total size of no more than maxsize (sum of all file sizes).

procpriority="intvalue": This is the prority of ed_sender transport process on the sender side. Under Linux and UNIX flavors, this is equivalent to nice values. Acceptable values for windows: 0 (NONE), 1(run only when Idle)), 2(below normal), 3(normal), 4(above normal) or 5 (highest). Default value is 0(unchanged). Acceptable values for Linux and other Unix flavors range from -20 (highest priority) to 19 (lowest priority). See the man page for nice under all Unix flavors.

qstorelimit="intvalue": This parameter indicates the fill percentage of the journaling queue that will cause edpcloud to give more priority to journaling than to the replication itself. The value should be between 0 and 100. When the number of blocks that needs to be journaled reacheds qstorelimit EDpCloud will yield the CPU to the journaling threads to allow them to catch up.

Controlling payload size for each stream

rowsize="nnnn": This is an integer between 1 and 16K. It determines how many entries are processed by each thread as a bulk transfer. The default is 1024 for Linux and UNIX flavors and 256 for Windows. This parameter may be adjusted internally by the system on failures.

Yielding the CPU to lower priority and reduce CPU load

backoff="nnn": When set to a value higher than 0, each thread for the particular link will sleep for nnn seconds to let other links transfer data with a higher priority. This value applies to the sender only and applies to all streams and worker threads on the sender.

Serialization of renames and deletes

serialrename="0|1": When set to 0, (the default is 1) this parameter unserializes renames and deletes. By default deletes and renames are processed one at a time (only one thread processes synchronization). To avoid any conflicts, make sure you also set workers to 1. Do not set to 0 unless you realy have experimented and tested it with your workload.

Post and Pre processing and integration with workflow

Administrators may need to run some commands, that may be required for workflow purposes, before a file is sent(sender side) or before it is stored(receiver side). Administrators may also need to run other commands after a file is sent and after it is received. They can use the post/pre keywords to specify the commands to run. The commands must be located in the directory postpre under base dir (same level as etc directory).

The user may specify if the sender or receiver should wait for the post or pre commands to finish before proceeding to the next task. They can use prewait or postwait for this purpose. However, the postwait and prewait may fail to wait since the main thread worker may exit before the pre/post processing are done. For this reason, we suggest to use the default of postwait and prewait values of 1 (Wait for post/pre processing to finish).

The post and pre directives can be used to create automated processing like ETL (Export Transform Load) and to distribute task execution while moving data to desired locations or aggregating data from desired locations.

post/pre commands

pre="precommand": This is the command to run before a file is sent(sender side) or stored(receiver side).

post="postcommand": This is the command to run after a file is sent(sender side) or stored(receiver side).

prewait="1": Wait for the pre command to finish.

prewait="0": Do not wait for the pre command to finish

postwait="1": Wait for the post command to finish.

postwait="0": Do not wait for the post command to finish

post/pre commands arguments

Both precommand and postcomand are passed a post/pre processing file as their first argument. The file contains the list of files that will/were be synced and to which the post/pre commands should be applied. The files are stored in a directory named postpre/postfiles under the base directory. The post/pre should dispose of these files once they are processed.

Pre processing file content

The pre processing files have the following format. Variables are separated using a pipe. The content of the pre processing file is as follows:

Post processing file content

The post processing files have the following format. Variables are separated using a pipe. The content of the post processing file is as follows:

When running the post commands, the command must examine the status variable and take action accordingly.

Example of post/pre processing
 
	<?xml version="1.0" encoding="UTF-8"?>
	  <config name="simple" password="0xAdD4D001:foo" >

  <link name="link0" >

   <sender hostname="orders.companya.com"  post="notify_clients" prewait="1" />

    <receiver hostname="sales.companyb.com" post="run_report1" postwait="0"
        storepath="/backup"
    />

  </link>
  </config>

In this case the receiver will run run_report1 and will not wait for it to finish. The sender will run notify_clients but it will wait for it to finish.

Encryption

EnduraData distinguishes between two types of encryptions: Messages and communication encryption. This applies to all communications between senders and receivers. Another type of encryption is relevant to file encryption. The user may decide to leave the files encrypted when sent(But they must remember the encryption keys if they want to get their data back.) or they may decide to decrypt files once they arrive to their destination. A user can use either communication encryption or file encryption or both. When using communication encryption alone, all data transmitted is encrypted by the sender and decrypted by the receiver if their keys match.

Encryption is controlled by a combination of parameters in eddist.cfg and files (edkeys, edfilekeys). Because of export restrictions, only AES 128 bit is available by default. Stronger encryption is available only in certain countries such as the US. Please check with your sales representative first.

Communication encryption

To encrypt communications between servers, create a file called edkeys in your etc directory:

edkeys file format

By default all communications are encrypted using edpasswd file as a key. Thus it is better not to use edkeys file at all.

You can use regular expressions for the link, the sender and the receiver. The lines are evaluated last in first out.

For every link, every sender, every receiver use foo as a key to encrypt communication.

Example of multiple lines in edkeys

Encrypting files on flight or on disk

File encryption is configured by setting an encryption key and an encryption/decryption directive using: key, encrypt or decrypt.

key="secretkey"
encrypt="1|0"
decrypt="1|0"

The encrypt and decrypt work in conjunction with key parameter. Both encrypt and decrypt are mutually exclusive.

EDpCloud can be configured to use encryption in a multitude of ways. Users have the following choices:

a. Encrypt communications between servers: use key="secretkey" for the sender and receiver or create file edkeys under etc directory.
b. Encrypt all files before they leave the server and leave them encrypted on the remote server: Use fkey="keyname" on the sender side. And specify encrypt="1". fkey and encrypt keywords must be set together in order to encrypt.
c. Encrypt all files before they leave but decrypt them before storing them.

Users can either decrypt on restore or they can use edcryptor to decrypt in place(see html documentation).

Example 1 of file encryption
 
	<?xml version="1.0" encoding="UTF-8"?>
	   <config name="simple" password="0xAdD4D001:foo" >

   <link name="link0" >

      <sender hostname="orders.companya.com"  
            key="mysecretkey" encrypt="1" />

      <receiver hostname="sales.companyb.com" decrypt="0"
         storepath="/backup"
     />

  </link>
  </config>

This example will encrypt files using mysecretkey before they leave the orders.companya.com. The data will remain encrypted on the remote machine (sales.companyb.com). The decrypt="0" decides whether data will be decrypted or not.

It is critical that the user remembers the encryption key and protects the directory where the configuration exists. If you lose the key, you will not be able to access the encrypted data.

Example 2 of file encryption

This example will encrypt files using mysecretkey before they leave the orders.companya.com. The data will be decrypted on the remote machine (sales.companyb.com).

Example of backing up to a local drive (1 to 1 )

The following is a simple configuration that backs up the content of /home to /backup.

Example of distributing content to many hosts (1 to many)

The following example has two links.

link1: sends data from localhost to localhost. Only content under /data1 and under /var/www/realm1 is sent. It is stored under /home/dest/link1.
link2: sends data from hostname targua to the following hosts: localhost, mumu, feddev0, mac11, sol13 and www2. Notice how link1 uses a different password than link2.

Each receiver can use include and exclude to specify what content to accept and what content to reject.

Example of many to one content collectors and consolidation

A simple configuration that can collect and aggregate content from many hosts to a single host (data protection/data recovery for example) is shown below. Even if this configuration is simple, it actually allows all hosts with the fully qualified name that match *.enduradata.com to send content to collector.enduradata.com for backup, data analytics or other decision making tools.

Example of Windows meets Linux, Mac or Unix

The following configuration shows a combined configuration for Windows and *NIX* platforms.

Counter example of a bad configuration

The following configuration will fail because localhost occurs twice as a receiver. Duplicate receivers within the same link are not allowed. Use edverify to detect these kind of errors.

 
	<?xml version="1.0" encoding="UTF-8"?>
	  <config name="simple" password="foo" >

 <link name="link0" >

   <sender hostname="localhost" 
          include="^/home/.*" 
          exclude="^.*core$" versions="1"/>

   <receiver hostname="localhost" 
          include="^/home/.*" storepath="/backup" versions="1"/> 

   <receiver hostname="localhost" 
         include="^/home/.*" storepath="/media/doc/o" versions="1"/> 

 </link>
</config>

Example of a correct way to configure the scenario above

Now you can send data to all links and all receivers using: edq -n dirname

Another bad configuration example

The following configuration will fail because host zulu is the same as localhost. In this case all content will be sent to /backup twice. Once for localhost and once for zulu instances.

Example of replicating to Google drive or Google cloud storage using rclone as a transport

The data will be stored under /home2/dest

In this case the data is replicated to Google cloud storage.

The data will be stored in bucket1/%weekday% (i.e bucket1/Monday, bucket1/Tuesday ...

The data will be stored in Google drive under a folder called gdrive-public.

Example:

rctransport.conf for the example above

The following is an example of a configuration file for use with EDpCloud.

Section gd1 below is referenced by link l2 in eddist.cfg above using the keyword cloudprovider="gd1". It will store data in bucket1/weekdayname (i.e. bucket1/TFriday).

Secrtion gdrive2 below is used by eddist.cfg to store data in google drive using cloudprovider="gdrive2". The data will be located in the googledrive directory gdrive-public

Notive that gd1 uses the bucket permission policy. Gdrive2 uses the google drive credentioals. Check the links above the sections and the documentation for your transport layer for how to configure.

Users can mount also use edpcloud transport by mounting the buckets or by using a VM on the cloud provider.

FILES

$ED_BASE_DIR/etc/eddist.cfg

This is the configuration file

$ED_BASE_DIR/etc/edpasswd

This is the password file. See edpasswd manual for more info

$ED_BASE_DIR/etc/managementhosts.allow

This is the list of hosts allowed to manage replication. The format is one host or IP per line

$ED_BASE_DIR/etc/dbdirective

File with SQL commands to be executed by eddist. When present this file contains sql that eddist will execute. Remove the file after it is executed.

$ED_BASE_DIR/etc/myaliases

This file has one hostname or IP per line. This file lists other IPs or hostnames by which this host is also known. They are synonymous with "localhost" and local IP.

SEE ALSO

edintro(8) edresume(8) edpause(8) edjob(8) edstat(8) ed_senderprovider(8) eddist.cfg(5) edfsmonitor.cfg(5) edscheduler.cfg(5)

ENVIRONMENT

SUPPORT

For more information contact <support@enduradata.com>

AUTHORS

A. A. El Haddi, elhaddi@ieee.org
A. Taouil, ataouil@enduradata.com
S. Dimitri, sdimitri@enduradata.com