Python APIs created for this project

Annotation module

For the purpose of annotating RNA types for genomic regions.

AnnoMax.overlap(bed1, bed2)

This function compares overlap of two Bed object from same chromosome

Parameters:
Returns:

boolean – True or False

Example:

>>> from xplib.Annotation import Bed
>>> from AnnoMax import overlap
>>> bed1=Bed(["chr1",10000,12000])
>>> bed2=Bed(["chr1",9000,13000])
>>> print overlap(bed1,bed2)
True
AnnoMax.Subtype(bed1, genebed, typ)

This function determines intron or exon or utr from a BED12 file.

Parameters:
  • bed1

    A Bed object defined by xplib.Annotation.Bed (BAM2X)

  • genebed – A Bed12 object representing a transcript defined by xplib Annotaton.Bed with information of exon/intron/utr from an BED12 file
Returns:

str – RNA subtype. “intron”/”exon”/”utr3”/”utr5”/”.”

Example:

>>> from xplib.Annotation import Bed
>>> from xplib import DBI
>>> from AnnoMax import Subtype
>>> bed1=Bed(["chr13",40975747,40975770])
>>> a=DBI.init("../../Data/Ensembl_mm9.genebed.gz","bed")
>>> genebed=a.query(bed1).next()
>>> print Subtype(bed1,genebed)
"intron"
AnnoMax.optimize_annotation(c_dic, bed, ref_detail)

This function will select an optimized annotation for the bed region from the genes in c_dic.

It will select the annotation based on a list of priorities. The list of priorities is: exon/utr of coding transcript > small RNA > exon of lincRNA > small RNA > exon/utr of nc transcript > intron of mRNA > intron of lincRNA. Genes on the same strand as the read(ProperStrand) will always have higher priority than those on the opposite strand (NonProperStrand). Repeat elements have the lowest priority (except rRNA_repeat according to the annotation files)

AnnoMax.annotation(bed, ref_allRNA, ref_detail, ref_repeat)

This function is based on overlap() and optimize_annotation() and Subtype() functions to annotate RNA type/name/subtype for any genomic region. This function will first find genes with maximum overlap with bed, and use the function optimize_annotation to select an optimized annotation for the bed with following steps:

  • Find hits (genes) with overlaps larger than Perc_overlap of the bed region length and build dic
  • Find hits (genes) with overlaps between (Perc_max * max_overlap, max_overlap) and build P_dic (for ProperStrand), N_dic (for NonProperStrand).
  • Find an annotation for the bed region among the hits.
Parameters:
  • bed

    A Bed object defined by xplib.Annotation.Bed (in BAM2X).

  • ref_allRNA – the DBI.init object (from BAM2X) for bed6 file of all kinds of RNA
  • ref_detail

    the DBI.init object for bed12 file of lincRNA and mRNA with intron, exon, UTR

  • ref_detail

    the DBI.init object for bed6 file of mouse repeat

Returns:

list of str – [type,name,subtype, strandcolumn]

Example:

>>> from xplib.Annotation import Bed
>>> from xplib import DBI
>>> from AnnoMax import annotation
>>> bed=Bed(["chr13",40975747,40975770])
>>> ref_allRNA=DBI.init("all_RNAs-rRNA_repeat.txt.gz","bed")
>>> ref_detail=DBI.init("Data/Ensembl_mm9.genebed.gz","bed")
>>> ref_repeat=DBI.init("Data/mouse.repeat.txt.gz","bed")
>>> print annotation(bed,ref_allRNA,ref_detail,ref_repeat)
["protein_coding","gcnt2","intron","ProperStrand"]

“annotated_bed” data class

class data_structure.annotated_bed(x=None, **kwargs)

To store, compare, cluster for the genomic regions with RNA annotation information. Utilized in the program Select_stronginteraction_pp.py

Cluster(c)

Store cluster information of self object

Parameters:c – cluster index

Example:

>>> a=annotated_bed(chr="chr13",start=40975747,end=40975770)
>>> a.Cluster(3)
>>> print a.cluster
3

Note

a.cluster will be the count information when a become a cluster object in Select_stronginteraction_pp.py

Update(S, E)

Update the upper and lower bound of the cluster after adding segments using Union-Find.

Parameters:
  • S – start loc of the newly added genomic segment
  • E – end loc of the newly added genomic segment

Example:

>>> a=annotated_bed(chr="chr13",start=40975747,end=40975770)
>>> a.Update(40975700,40975800)
>>> print a.start, a.end
40975700 40975800
__init__(x=None, **kwargs)

Initiation example:

>>> str="chr13  40975747        40975770        +       ATTAAG...TGA    protein_coding  gcnt2   intron"
>>> a=annotated_bed(str)
or
>>> a=annotated_bed(chr="chr13",start=40975747,end=40975770,strand='+',type="protein_coding",)
__lt__(other)

Compare two objects self and other when they are not overlapped

Parameters:other – another annotated_bed object
Returns:boolean – “None” if overlapped.

Example:

>>> a=annotated_bed(chr="chr13",start=40975747,end=40975770)
>>> b=annotated_bed(chr="chr13",start=10003212,end=10005400)
>>> print a>b
False
__str__()

Use print function to output the cluster information (chr, start, end, type, name, subtype,cluster)

Example:

>>> str="chr13  40975747        40975770        +       ATTAAG...TGA    protein_coding  gcnt2   intron"
>>> a=annotated_bed(str)
>>> a.Cluster(3)
>>> a.Update(40975700,40975800)
>>> print a
"chr13  40975700        40975800        protein_coding  gcnt2   intron  3"
overlap(other)

Find overlap between regions

Parameters:other – another annotated_bed object
Returns:boolean

“RNAstructure” class

class RNAstructure.RNAstructure(exe_path=None)

Interface class for RNAstructure executable programs.

DuplexFold(seq1=None, seq2=None, dna=False)

Use “DuplexFold” program to calculate the minimum folding between two input sequences

Parameters:
  • seq1,seq2 – two DNA/RNA sequences as string, or existing fasta file name
  • dna – boolean input. Specify then DNA parameters are to be used
Returns:

minimum binding energy, (unit: kCal/Mol)

Example:

>>> from RNAstructure import RNAstructure
>>> RNA_prog = RNAstructure(exe_path="/home/yu68/Software/RNAstructure/exe/")
>>> seq1 = "TAGACTGATCAGTAAGTCGGTA"
>>> seq2 = "GACTAGCTTAGGTAGGATAGTCAGTA"
>>> energy=RNA_prog.DuplexFold(seq1,seq2)
>>> print energy
Fold(seq=None, ct_name=None, sso_file=None, Num=1)

Use “Fold” program to predict the secondary structure and output dot format.

Parameters:
  • seq – one DNA/RNA sequence as string, or existing fasta file name
  • ct_name – specify to output a ct file with this name, otherwise store in temp, default: None
  • sso_file – give a single strand offset file, format see http://rna.urmc.rochester.edu/Text/File_Formats.html#Offset
  • Num – choose Num th predicted structure
Returns:

dot format of RNA secondary structure and RNA sequence.

Example:

>>> from RNAstructure import RNAstructure
>>> RNA_prog = RNAstructure(exe_path="/home/yu68/Software/RNAstructure/exe/")
>>> seq = "AUAUAAUUAAAAAAUGCAACUACAAGUUCCGUGUUUCUGACUGUUAGUUAUUGAGUUAUU"
>>> sequence,dot=RNA_prog.Fold(seq)
>>> assert(seq==sequence)
__init__(exe_path=None)

Initiation of object

Parameters:exe_path – the folder path of the RNAstructure executables

Example:

>>> from RNAstructure import RNAstructure
>>> RNA_prog = RNAstructure(exe_path="/home/yu68/Software/RNAstructure/exe/")
scorer(ct_name1, ct_name2)

Use ‘scorer’ pogram to compare a predicted secondary structure to an accepted structure. It calculates two quality metrics, sensitivity and PPV

Parameters:
  • ct_name1 – The name of a CT file containing predicted structure data.
  • ct_name2 – The name of a CT file containing accepted structure data, can only store one structure.
Returns:

sensitivity, PPV, number of the best predicted structure.

Example:

>>> ct_name1 = "temp_prediction.ct"
>>> ct_name2 = "temp_accept.ct"
>>> from RNAstructure import RNAstructure
>>> RNA_prog = RNAstructure(exe_path="/home/yu68/Software/RNAstructure/exe/")
>>> sensitivity, PPV, Number = RNA_prog.scorer(ct_name1,ct_name2)

Interface class for RNAstructure executable programs.

RNAstructure.dot2block(dot_string, name='Default')

convert dot format of RNA secondary structure into several linked blocks

Parameters:
  • dot_string – the dot format of RNA secondary structure
  • name – name of the RNA
Returns:

A list of all stems, each stem is a dictionary with ‘source’ and ‘target’

Example:

>>> from RNAstructure import dot2block
>>> stems = dot2block("(((((...)))...(((...)))..))","RNA_X")
>>> print stems
[{'source': {'start': 2, 'chr': 'test', 'end': 4}, 'target': {'start': 8, 'chr': 'test', 'end': 10}}, {'source': {'start': 14, 'chr': 'test', 'end': 16}, 'target': {'start': 20, 'chr': 'test', 'end': 22}}, {'source': {'start': 0, 'chr': 'test', 'end': 1}, 'target': {'start': 25, 'chr': 'test', 'end': 26}}]