Commit 3f707d1a authored by shubham's avatar shubham

hadoop readme update

parent d6d68791
......@@ -25,23 +25,52 @@ The profiler and the hadoop itself requires password-less SSH access into other
1. Take note of the hostname or the IP address of the cluster machines.
2. Run `ssh-keygen` to generate ssh ket pair. Give a blank filename when prompted
3. Run `ssh-copy-id <hostname or IP address>`. This adds the public key to the host's `~/.ssh/authorized_keys` directory
4. Run `ssh <hostname or IP>` to see if SSH is working.
4. Run `ssh <hostname or IP>` to test if SSH is working.
### Profiler config
Your profiler program files should be installed in a home directory. with a `programs/` and a `plot_logs/` directory.
Go to the profiler home directory and into the `./profiler/programs/` directory. We need to edit a config file by adding the IP addresses of the cluster machines:
1. Edit `cluster-profiler.config` to add hostnames. #TODO
1. Edit `cluster-profiler.config` to add hostnames within the property tag which has `compute.nodes` as the property name:
2. Add multiple hostname and IP-Address within the `<values> </values>` tag.
The finished config file should have the hostnames/IP's within the `<value>` tags.
## Clean Jar Hadoop script
## Running the Jar Deployment Script
### Clean Jar Hadoop script
Running the jar with the label clean, takes in one or more BAM files as input and output multiple BAM Files.
Go to the folder containing the Hadoop deployment shell scripts
In the `clean.sh` script :
1. change variable `input_dir` to the HDFS folder with the raw BAM files
2. change variable `output_dir` to the HDFS folder where the output of thr Map Reduce process will go.
3. variable `profiler_dir` is the home directory of the profiler setup (as discussed in the previous section)
4. `property_file` is the the configuration file which should also be present in the same folder. The file should be titled `./config.properties`
Opening the 'clean.sh' script.
the 'input_dir' is the HDFS folder with the raw BAM files
the 'output_dir' is the HDFS folder where the output of thr Map Reduce process will go.
Before running the Hadoop job ansure that this directory exists using the 'hadoop fs' commands
Run the script giving a log folder as an argument:
`> bash ./clean.sh log_dir_clean`
## Mark-duplicated Hadoop script
TODO
Running the jar with the label mark-duplicate or md, takes multiple BAM files as input and output multiple BAM Files.
Go to the folder containing the Hadoop deployment shell scripts
This time the script is the `markdup.sh` script :
The rest of the steps remain the same:
1. change `input_dir` to the HDFS folder with the multiple BAM files - **Note:** This should be the output directory of the previous clean step
2. change `output_dir` to the HDFS folder where the output of this Map Reduce process will go.
3. variable `profiler_dir` is the home directory of the profiler setup (as discussed in the previous section)
4. `property_file` is the the configuration file which should also be present in the same folder. The file should be titled `./config.properties`
Before running the Hadoop job ansure that this directory exists using the 'hadoop fs' commands
Run the script giving a log folder as an argument:
`> bash ./markdup.sh log_dir_md`
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment