Commit e3303aea by Abhishek Roy

Move code from internal SVN to public Git

parents
### About
This project contains the different software layers of the Gesall big data platform for genome data analysis.
#### Code Layout
1. Runtime and Storage layers are in `hdfs.*` packages.
2. Data Partitioning schemes (with MapReduce wrappers) are in `program.{alignment|clean|md}.latest` packages.
3. Error Diagnosis programs are in `correctness.*` packages.
### Building
##### Eclipse IDE
1. Import the code from `gesall-core` repository into Eclipse.
2. Add `gesall-htsjdk` and `gesall-picard` Eclipse projects to dependencies in `Project->Properties->Java Build Path->Projects`.
3. Add all the external JAR files from `gesall-libs` into `Project->Properties->Java Build Path->Libraries`.
#### Exporting code as JAR files
##### Eclipse IDE
1. Use `File->Export->Runnable JAR` option with library handling set to `Extract required libraries into generated JAR`.
2. This will create a self-contained, fat JAR file.
3. Apache Hadoop JAR files in `gesall-libs` should be of the same version as the deployment Hadoop cluster.
##### Command line
1. There are some example `ant` build files in the `ant-build` directory.
2. These build files were generated using `Export->Ant buildfiles` option in Eclipse. But references to specific versions of Hadoop libraries were removed.
### License
Our code is released under MIT license.
\ No newline at end of file
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<project default="create_run_jar" name="Create Runnable Jar for Project gesall-rad">
<!--this file was created by Eclipse Runnable JAR Export Wizard-->
<!--ANT 1.7 is required -->
<target name="create_run_jar">
<jar destfile="/Users/aroy/workspace/gesall-rad/bin/picard_md.jar" filesetmanifest="mergewithoutmain">
<manifest>
<!-- <attribute name="Main-Class" value="hdfs.clean.bam.bloom.md.MarkDuplicatesMain"/> -->
<attribute name="Main-Class" value="hdfs.clean.mdfix.MDFixMain"/>
<attribute name="Class-Path" value="."/>
</manifest>
<fileset excludes="picard_md_lib/*" dir="/Users/aroy/workspace/gesall-rad/bin"/>
<zipfileset excludes="META-INF/*.SF" src="/Users/aroy/workspace/gesall-libs/apache/cmdline/commons-exec-1.2.jar"/>
<zipfileset excludes="META-INF/*.SF" src="/Users/aroy/workspace/gesall-libs/json/json-20140107.jar"/>
<zipfileset excludes="META-INF/*.SF" src="/Users/aroy/workspace/gesall-htsjdk/dist/snappy-java-1.0.3-rc3.jar"/>
<fileset dir="/Users/aroy/workspace/gesall-htsjdk/bin"/>
<zipfileset excludes="META-INF/*.SF" src="/Users/aroy/workspace/gesall-htsjdk/lib/testng/testng-5.5-jdk15.jar"/>
<zipfileset excludes="META-INF/*.SF" src="/Users/aroy/workspace/gesall-htsjdk/lib/commons-jexl-2.1.1.jar"/>
<zipfileset excludes="META-INF/*.SF" src="/Users/aroy/workspace/gesall-htsjdk/lib/snappy-java-1.0.3-rc3.jar"/>
<fileset dir="/Users/aroy/workspace/gesall-picard/bin"/>
<zipfileset excludes="META-INF/*.SF" src="/Users/aroy/workspace/gesall-picard/lib/testng/testng-5.5-jdk15.jar"/>
<zipfileset excludes="META-INF/*.SF" src="/Users/aroy/workspace/gesall-picard/lib/ant/bcel-5.2.jar"/>
<zipfileset excludes="META-INF/*.SF" src="/Users/aroy/workspace/gesall-libs/jopt/jopt-simple-4.8.jar"/>
<zipfileset excludes="META-INF/*.SF" src="/Users/aroy/workspace/gesall-libs/guava-18.0/guava-18.0.jar"/>
</jar>
</target>
</project>
{\rtf1\ansi\ansicpg1252\cocoartf1265\cocoasubrtf210
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural
\f0\fs24 \cf0 \
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural
\b \cf0 HADOOP
\b0 \
\
hadoop fsck /user/aroy/bam/test/na12878_vsmall_nb.bam -files -blocks -racks\
\
To set native yarn/hadoop-env paths changed to lib/native\'85 not just lib\
\
\
\b MR Rounds
\b0 \
\
https://issues.apache.org/jira/browse/HADOOP-2735\
Set java.io.tmpdir\
\
\b MAVEN\
\
{\field{\*\fldinst{HYPERLINK "http://preilly.me/2013/05/10/how-to-install-maven-on-centos/"}}{\fldrslt http://preilly.me/2013/05/10/how-to-install-maven-on-centos/}}\
\
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural
\b0 \cf0 mvn clean dependency:copy-dependencies package -DskipTests\
\
\
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural
\b \cf0 CLUSTER\
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural
\b0 \cf0 \
csshX --login aroy aroy@compute-0-[3-12]\
\
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural
\b \cf0 Alignment
\b0 \
\
split -l 85000000 reads_shuffled_merge.fastq split\
\
\b MarkDuplicates
\b0 \
\
http://sourceforge.net/p/picard/wiki/Main_Page/#q-why-does-a-picard-program-use-so-many-threads\
A: Essentially what it does (for pairs; single-end data is also handled) is to find the 5' coordinates and mapping orientations of each read pair. When doing this it takes into account all clipping that has taking place as well as any gaps or jumps in the alignment. You can thus think of it as determining "if all the bases from the read were aligned, where would the 5' most base have been aligned". It then matches all read pairs that have identical 5' coordinates and orientations and marks as duplicates all but the "best" pair. "Best" is defined as the read pair having the highest sum of base qualities as bases with Q >= 15.\
\
\
cat coverage.txt | tr -t : ' ' | tr -t '\\t' ' ' | cut -d ' ' -f3 | grep -n 771822\
\
Use the same version Picard uses, i.e. 1.0.3-rc3\
\
\
Snappy disabled in code\
Djava.io.tmpdir=/path/to/tmpdir\
Dsnappy.disable=true\
Maybe check the snappy version\
}
\ No newline at end of file
<?xml version="1.0"?>
<configuration>
<property>
<name>hdfs.address</name>
<value>yeeha.cs.umass.edu:8080</value>
<description>HDFS</description>
</property>
</configuration>