Commit d6d68791 authored by shubham's avatar shubham

Hadoop Readme files - initial

parent 6968dfae
# Hadoop Deployement of Gesall Jars
There are three jar files intended to handle different stages of the processing pipeline:
`alignment` jar , `clean` jar and `md` jar.
## About the Jar files
Given raw FastQ files these jar needs to be deployed sequentially in the following order:
1. Alignment jar: Input is multiple FastQ files (logical files), Output is multiple BAM files
2. Clean jar : Input is one or more BAM files, Output is multiple BAM files
3. Mark Dulicate (md) jar: Both inputs and outputs are BAM files.
The steps for running the alignment jar are relatively more involved. There are certain pre-processing step. These are outlined in `README_align.md`
For the clean and mark_duplicate jars. There are scripts provided, namely `clean.sh` and `md.sh`.
These scripts are responsible for specifying the HDFS input and output directories, setting up the profiler, starting the hadoop execution, collecting the profiler plots and organizing the log files.
But before running the scripts we need to do a few setup steps:
## Before Running the scripts
### Setting up SSH
The profiler and the hadoop itself requires password-less SSH access into other machines of the cluster. One way is simply generate a SSH Key Pair and add it to each machine, this can be done through a script as well
1. Take note of the hostname or the IP address of the cluster machines.
2. Run `ssh-keygen` to generate ssh ket pair. Give a blank filename when prompted
3. Run `ssh-copy-id <hostname or IP address>`. This adds the public key to the host's `~/.ssh/authorized_keys` directory
4. Run `ssh <hostname or IP>` to see if SSH is working.
### Profiler config
Go to the profiler home directory and into the `./profiler/programs/` directory. We need to edit a config file by adding the IP addresses of the cluster machines:
1. Edit `cluster-profiler.config` to add hostnames. #TODO
## Clean Jar Hadoop script
Opening the 'clean.sh' script.
the 'input_dir' is the HDFS folder with the raw BAM files
the 'output_dir' is the HDFS folder where the output of thr Map Reduce process will go.
Before running the Hadoop job ansure that this directory exists using the 'hadoop fs' commands
## Mark-duplicated Hadoop script
TODO
# Hadoop Deployment for Gesall Alignment Jar
## Pre-processing
## Hadoop Configuration
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment