Home
Discussion and Lectures
Intro to the Workshop
Schedule
What is Bioinformatics/Genomics?
Experimental Design and Cost Estimation
Closing thoughts
Data Reduction
Files and Filetypes
Prepare dataset
Preprocessing raw data
Indexing a Genome
Alignment with Star
Generating counts tables
Data analysis
Annotation from BioMart
Prepare R for data analysis
Differential Expression Analysis
Pathway Analysis
Prerequisites
CLI
R
Support
Cheat Sheets
Software and Links
Scripts
ETC
CAT website
Github page
Report Errors

Alignment to Read Counts & Visualization in IGV

Initial Setup
Mapping vs Assembly
Aligners/Mappers
Mapping against the genome vs transcriptome
Counting reads as a proxy for gene expression
Alignments
Running STAR on the experiment
Quality Assurance - Mapping statistics as QA/QC.

Initial Setup

This document assumes preproc htstream has been completed. To catch up to where we are:

cd /share/workshop/$USER/rnaseq_example

refcheck=$(egrep "DONE: Genome generation" References/star.overlap100.gencode.M35/Log.out)
if [[ ! -z $refcheck ]]
then
  ln -s /share/workshop/original_dataset/References/star.overlap100.gencode.M35 /share/workshop/$USER/rnaseq_example/References/.
fi

Mapping vs Assembly

Assembly seeks to put together the puzzle without knowing what the picture is.

The focus is on the pieces, how they fit together.

Mapping (or alignment to a reference) tries to put the puzzle pieces directly onto an image of the picture._

The focus is on the puzzle, regions of the puzzle that contain certain characteristics (ex. what background) that will help you place the piece onto the puzzle.
In mapping the question is more, given a small chunk of sequence, where in the genome did this sequence most likely come from.
The goal then is to find the match(es) with either the “best” edit distance (smallest difference), or all matches with edit distance less than max edit distance. Main issues are:
Large search space
Regions of similarity (aka repeats)
Gaps (INDELS)
Complexity (RNA, splicing, transcripts)

Alignment concepts

Multimappers:
- Reads that align equally well to more than one reference location.
- Generally, multimappers are discounted in variant detection, and are often discounted in counting applications (like RNA-Seq … would “cancel” out anyway).
- Note: multimapper “rescue” in some algorithms (RSEM, eXpress).
Duplicates:
- Reads or read pairs arising from the same original library fragment, either during library preparation (PCR duplicates) or during sequencing (optical duplicates).
- Generally, duplicates can only be detected reliably with paired-end sequencing. If PE, they’re discounted in variant detection, and discounted in counting applications (like RNA-Seq).
Clipping vs Splicing
- soft-clipped: bases in 5’ and 3’ of the read are NOT part of the alignment.
- hard-clipped: bases in 5’ and 3’ of the read are NOT part of the alignment AND those bases have been removed from the read sequence in the BAM file. The ‘real’ sequence length would be length(SEQ)+ count-of-hard-clipped-bases
- From biostars

alignment_figure3

Inner length, insert size, fragment length

alignment_figure4
From This Biostars answer

Considerations when mapping

Placing reads in regions that do not exist in the reference genome (reads extend off the end of linearized fragments) [ mitochondrial, plasmids, structural variants, etc.].
Sequencing errors and genetic variation: alignment between read and true source in genome may have more differences than alignment with some other copy of repeat.
What if the closest fully sequenced genome is too divergent?
Placing reads in repetitive regions: Some algorithms only return 1 mapping; If multiple: map quality = 0
Algorithms that use paired-end information => might prefer correct distance over correct alignment.

In RNAseq data, you must also consider effect of splice junctions, reads may span an intron.

alignment_figure1

Aligners/Mappers

Many alignment algorithms to choose from. Examples include:

Spliced Aligners
- STAR
- HiSAT2 (formerly Tophat [Bowtie2])
- GMAP/GSNAP
- SOAPsplice
- MapSplice
Aligners that can ’clip’
- bwa-mem
- Bowtie2 in local mode

Pseudo-aligners (salmon and kalisto)

Quasi-mapping
Probabilistic
Map to transcripts, not genome
Does transcript quantifications (or gene)
Blazing FAST and can run on most laptops
Experience suggests differences between “traditional” mappers are in the low abundance genes.

Mapping against the genome vs transcriptome

May seem intuitive to map RNAseq data to transcriptome, but it is not that simple.

Transcriptomes are rarely complete.
Which transcript of a gene should you map to? canonical transcript (which is that)?
Shouldn’t map to all splice variants as these would show up as ‘multi-mappers’.

More so, a aligner will try to map every read, somewhere, provided the alignment meets its minimum requirements.

Need to provide a mapper with all possible places the read could have arisen from, which is best represented by the genome. Otherwise, you get mis-mapping because its close enough.

Genome and Genome Annotation

Genome sequence fasta files and annotation (gff, gtf) files go together! These should be identified at the beginning of analysis.

Genome fasta files should include all primary chromosomes, unplaced sequences and un-localized sequences, as well as any organelles. Should not contain any contigs that represent patches, or alternative haplotypes.
If you expect contamination, or the presence of additional sequence/genome, add the sequence(s) to the genome fasta file.
Annotation file should be GTF (preferred), and should be the most comprehensive you can find.
- Chromosome names in the GTF must match those in the fasta file (they don’t always do).

Counting reads as a proxy for gene expression

The more you can count (and HTS sequencing systems can count a lot) the better the measure of copy number for even rare transcripts in a population.

Most RNA-seq techniques deal with count data. Reads are mapped to a reference genome, transcripts are detected, and the number of reads that map to a transcript (or gene) are counted.
Read counts for a transcript are roughly proportional to the gene’s length and transcript abundance (in whole transcript methods).

Technical artifacts should be considered during counting

Mapping quality
Map-ability (uniqueness), the read is not ambiguous

Options are (STAR, HTSEQ, featureCounts)

The HTSEQ way

Given a sam/bam file with aligned sequence reads and a list of genomic feature (genes locations), we wish to count the number of reads (fragments) than overlap each feature.
- Features are defined by intervals, they have a start and stop position on a chromosome.
- For this workshop and analysis, features are genes which are the union of all its exons. You could consider each exon as a feature, for alternative splicing.
Htseq-count has three overlapping modes
- union
- intersection-strict
- intersection-nonempty

alignment_figure2 from the HTSeq Paper

Star Implementation

Counts coincide with Htseq-counts under default parameters (union and tests all orientations). Need to specify GTF file at genome generation step or during mapping.

Output, 4 columns
- GeneID
- Counts for unstranded
- Counts for first read strand
- Counts for second read strand

N_unmapped 213761 213761 213761 N_multimapping 132491 132491 132491 N_noFeature 309643 2619257 322976 N_ambiguous 116750 892 47031 ENSMUSG00000102693 0 0 0 ENSMUSG00000064842 0 0 0 ENSMUSG00000051951 0 0 0 ENSMUSG00000102851 0 0 0 ENSMUSG00000103377 0 0 0 ENSMUSG00000104017 0 0 0

Choose the appropriate column given the library preparation characteristics and generate a matrix expression table, columns are samples, rows are genes.

What does stranded and unstranded mean? Which is better and why? Stranded vs Unstraded

Alignments

We are now ready to try an alignment:

cd /share/workshop/$USER/rnaseq_example

Then run the star commands

mkdir tmp
STAR \
--runThreadN 4 \
    --genomeDir /share/workshop/$USER/rnaseq_example/References/star.overlap100.gencode.M35 \
    --outSAMtype BAM SortedByCoordinate \
    --quantMode GeneCounts \
    --outFileNamePrefix tmp/mouse_110_WT_C.htstream_ \
    --readFilesCommand zcat \
    --readFilesIn  01-HTS_Preproc/mouse_110_WT_C/mouse_110_WT_C_R1.fastq.gz  01-HTS_Preproc/mouse_110_WT_C/mouse_110_WT_C_R2.fastq.gz

In the command, we are telling star to count reads on a gene level (‘–quantMode GeneCounts’), the prefix for all the output files will be mouse_110WT_C.htstream, the command to unzip the files (zcat), and finally, the input file pair.

Now let’s take a look at an alignment in IGV.

We first need to index the bam file, will use ‘samtools’ for this step, which is a program to manipulate SAM/BAM files. Take a look at the options for samtools and ‘samtools index’.

module load samtools
samtools
samtools index

We need to index the BAM file:

cd /share/workshop/$USER/rnaseq_example/HTS_testing
samtools index mouse_110_WT_C.htstream_Aligned.sortedByCoord.out.bam

IF for some reason it didn’t finish, is corrupted or you missed the session, you can copy over a completed copy.

cp /share/biocore/workshops/2023-June-mRNASeq/HTS_testing/mouse_110_WT_C.htstream_Aligned.sortedByCoord.out.bam* /share/workshop/$USER/rnaseq_example/HTS_testing

Transfer mouse_110_WT_C.htstream_Aligned.sortedByCoord.out.bam and mouse_110_WT_C.htstream_Aligned.sortedByCoord.out.bam.bai (the index file) to your computer using scp,FileZilla or winSCP.

Windows users can use WinSCP or FileZilla, both of which are GUI based.

Mac/Linux users can use scp. In a new shell session on my laptop. NOT logged into tadpole. Replace [your_username] with your username.

mkdir ~/rnaseq_workshop
cd ~/rnaseq_workshop
scp [your_username]@tadpole.genomecenter.ucdavis.edu:/share/workshop/mrnaseq_workshop/[your_username]/rnaseq_example/HTS_testing/mouse_110_WT_C.htstream_Aligned.sortedByCoord.out.bam* .

Its ok if the mkdir command fails (“File exists”) because we aleady created the directory earlier.

Now we are ready to use IGV.

Go to the IGV page at the Broad Institute.

index_igv1

And then navigate to the download page, IGV download

index_igv2

Here you can download IGV for your respective platform (Window, Mac OSX, Linux), but we are going to use the web application they supply, IGV web app.

index_igv3

The first thing we want to do is load the Mouse genome. Click on “Genomes” in the menu and choose “Mouse (GRCm39/mm39)”.

index_igv4

Now let’s load the alignment bam and index files. Click on “Tracks” and choose “Local File …”.

index_igv5

Navigate to where you transferred the bam and index file and select them both.

index_igv6

Now your alignment is loaded. Any loaded bam file aligned to a genome is called a “track”.

index_igv7

Lets take a look at the alignment associated with the gene Fn1, and if for some reason it doesn’t find HBB (web IGV can be fickle) go to position chr1:71,610,633-71,628,073. If you don’t see any reads, this likely means your in the wrong genome, double check that it says mm39 in the top left.

index_igv8

index_igv9

You can zoom in by clicking on the plus sign (top right) or zoom out by clicking the negative sign (click it twice). You also may have to move around by clicking and dragging in the BAM track window.

You can also zoom in by clicking and dragging across the number line at the top. That section will highlight, and when you release the button, it will zoom into that section.

index_igv10

index_igv11

Reset the window by searching for Fn1 again, and then zoom in 2 steps.

index_igv12

See that the reads should be aligning within the exons in the gene. This makes sense, since RNA-Seq reads are from exons. Play with the settings on the right hand side a bit and selecting reads.

index_igv13

index_igv14

index_igv15

index_igv16

index_igv17

Running STAR on the experiment

work through Running STAR on all samples
work through QA/QC of the experiment
complete the questions at the end

We can now run STAR across all samples on the real data using a SLURM script, star.slurm, that we should take a look at now.

cd /share/workshop/$USER/rnaseq_example  # We'll run this from the main directory
wget https://raw.githubusercontent.com/ucsf-cat-bioinformatics/2024-08-RNA-Seq-Analysis/master/software_scripts/scripts/star.sh
less star.sh

#!/bin/bash

## assumes star version 2.7.11b
## assumes STAR is available on the Path

start=`date +%s`
echo $HOSTNAME

outpath='02-STAR_alignment'
[[ -d ${outpath} ]] || mkdir ${outpath}

REF="References/star.overlap100.gencode.M35"

for sample in `cat samples.txt`
do
  [[ -d ${outpath}/${sample} ]] || mkdir ${outpath}/${sample}
  echo "SAMPLE: ${sample}"

  call="STAR
       --runThreadN 8 \
       --genomeDir $REF \
       --outSAMtype BAM SortedByCoordinate \
       --readFilesCommand zcat \
       --readFilesIn 01-HTS_Preproc/${sample}/${sample}_R1.fastq.gz 01-HTS_Preproc/${sample}/${sample}_R2.fastq.gz \
       --quantMode GeneCounts \
       --outFileNamePrefix ${outpath}/${sample}/${sample}_ \
       > ${outpath}/${sample}/${sample}-STAR.stdout 2> ${outpath}/${sample}/${sample}-STAR.stderr"

  echo $call
  eval $call
done

end=`date +%s`
runtime=$((end-start))
echo $runtime

When you are done, type “q” to exit.

After looking at the script, lets run it.

bash star.sh  # moment of truth!

Quality Assurance - Mapping statistics as QA/QC.

Once your jobs have finished successfully.

Use a script of ours, star_stats.sh to collect the alignment stats. Don’t worry about the script’s contents at the moment; you’ll use very similar commands to create a counts table in the next section. For now:

cd /share/workshop/$USER/rnaseq_example # We'll run this from the main directory
wget https://raw.githubusercontent.com/ucsf-cat-bioinformatics/2024-08-RNA-Seq-Analysis/master/software_scripts/scripts/star_stats.sh
bash star_stats.sh

#!/bin/bash

echo -en "sample_names" > names.txt
echo -en "total_in_feature\t" > totals.txt
cat 02-STAR_alignment/*/*ReadsPerGene.out.tab | head -4 | cut -f1 > stats.txt
cat samples.txt | while read sample; do
    echo ${sample}
    echo -en "\t${sample}" >> names.txt
    head -4 02-STAR_alignment/${sample}/${sample}_ReadsPerGene.out.tab | cut -f4 > temp1
    paste stats.txt temp1 > temp2
    mv temp2 stats.txt
    tail -n +5 02-STAR_alignment/${sample}/${sample}_ReadsPerGene.out.tab | cut -f4 | \
        perl -ne '$tot+=$_ }{ print "$tot\t"' >> totals.txt
done
echo -en "\n" >> names.txt
cat names.txt stats.txt totals.txt > temp1
mv temp1 summary_star_alignments.txt
rm stats.txt
rm names.txt
rm totals.txt

Transfer summary_star_alignments.txt to your computer.

Or in case of emergency, download this copy: summary_star_alignments.txt

Questions:

Look through the files in an output directory and check out what is present and discuss what each of them mean. (for example: cd /share/workshop/$USER/rnaseq_example/02-STAR_alignment/mouse_110_WT_C )
Come up with a brief command you might use to check that all of the sample alignments using STAR have a reasonable output and/or did not produce any errors.
Open summary_star_alignments.txt in excel (or excel like application), and review. The table that this script creates (“summary_star_alignments.txt”). Are all samples behaving similarly? Discuss …
If time, find some other regions/genes with high expression using IGV with your group. (Looking at genes the paper references is a great place to start)

☰ Menu

Bulk RNA-Seq Analysis