☰ Menu

      Bulk RNA-Seq Analysis

Home
Discussion and Lectures
Intro to the Workshop
Schedule
What is Bioinformatics/Genomics?
Experimental Design and Cost Estimation
Closing thoughts
Data Reduction
Files and Filetypes
Prepare dataset
Preprocessing raw data
Indexing a Genome
Alignment with Star
Generating counts tables
Data analysis
Annotation from BioMart
Prepare R for data analysis
Differential Expression Analysis
Pathway Analysis
Prerequisites
CLI
R
Support
Cheat Sheets
Software and Links
Scripts
ETC
CAT website
Github page
Report Errors

Alignment to Read Counts & Visualization in IGV

  1. Initial Setup
  2. Mapping vs Assembly
  3. Aligners/Mappers
  4. Mapping against the genome vs transcriptome
  5. Counting reads as a proxy for gene expression
  6. Alignments
  7. Running STAR on the experiment
  8. Quality Assurance - Mapping statistics as QA/QC.

Initial Setup

This document assumes preproc htstream has been completed. To catch up to where we are:

cd /share/workshop/$USER/rnaseq_example

refcheck=$(egrep "DONE: Genome generation" References/star.overlap100.gencode.M35/Log.out)
if [[ ! -z $refcheck ]]
then
  ln -s /share/workshop/original_dataset/References/star.overlap100.gencode.M35 /share/workshop/$USER/rnaseq_example/References/.
fi

Mapping vs Assembly

Assembly seeks to put together the puzzle without knowing what the picture is.

Mapping (or alignment to a reference) tries to put the puzzle pieces directly onto an image of the picture._

Alignment concepts

alignment_figure3

alignment_figure4
From This Biostars answer

Considerations when mapping

In RNAseq data, you must also consider effect of splice junctions, reads may span an intron.

alignment_figure1


Aligners/Mappers

Many alignment algorithms to choose from. Examples include:

Pseudo-aligners (salmon and kalisto)


Mapping against the genome vs transcriptome

May seem intuitive to map RNAseq data to transcriptome, but it is not that simple.

More so, a aligner will try to map every read, somewhere, provided the alignment meets its minimum requirements.

Genome and Genome Annotation

Genome sequence fasta files and annotation (gff, gtf) files go together! These should be identified at the beginning of analysis.


Counting reads as a proxy for gene expression

The more you can count (and HTS sequencing systems can count a lot) the better the measure of copy number for even rare transcripts in a population.

Technical artifacts should be considered during counting

Options are (STAR, HTSEQ, featureCounts)

The HTSEQ way

alignment_figure2 from the HTSeq Paper

Star Implementation

Counts coincide with Htseq-counts under default parameters (union and tests all orientations). Need to specify GTF file at genome generation step or during mapping.

N_unmapped 213761 213761 213761 N_multimapping 132491 132491 132491 N_noFeature 309643 2619257 322976 N_ambiguous 116750 892 47031 ENSMUSG00000102693 0 0 0 ENSMUSG00000064842 0 0 0 ENSMUSG00000051951 0 0 0 ENSMUSG00000102851 0 0 0 ENSMUSG00000103377 0 0 0 ENSMUSG00000104017 0 0 0

Choose the appropriate column given the library preparation characteristics and generate a matrix expression table, columns are samples, rows are genes.

What does stranded and unstranded mean? Which is better and why? Stranded vs Unstraded


Alignments

  1. We are now ready to try an alignment:
cd /share/workshop/$USER/rnaseq_example
  1. Then run the star commands
mkdir tmp
STAR \
--runThreadN 4 \
    --genomeDir /share/workshop/$USER/rnaseq_example/References/star.overlap100.gencode.M35 \
    --outSAMtype BAM SortedByCoordinate \
    --quantMode GeneCounts \
    --outFileNamePrefix tmp/mouse_110_WT_C.htstream_ \
    --readFilesCommand zcat \
    --readFilesIn  01-HTS_Preproc/mouse_110_WT_C/mouse_110_WT_C_R1.fastq.gz  01-HTS_Preproc/mouse_110_WT_C/mouse_110_WT_C_R2.fastq.gz

In the command, we are telling star to count reads on a gene level (‘–quantMode GeneCounts’), the prefix for all the output files will be mouse_110WT_C.htstream, the command to unzip the files (zcat), and finally, the input file pair.

Now let’s take a look at an alignment in IGV.

  1. We first need to index the bam file, will use ‘samtools’ for this step, which is a program to manipulate SAM/BAM files. Take a look at the options for samtools and ‘samtools index’.
module load samtools
samtools
samtools index

We need to index the BAM file:

cd /share/workshop/$USER/rnaseq_example/HTS_testing
samtools index mouse_110_WT_C.htstream_Aligned.sortedByCoord.out.bam

IF for some reason it didn’t finish, is corrupted or you missed the session, you can copy over a completed copy.

cp /share/biocore/workshops/2023-June-mRNASeq/HTS_testing/mouse_110_WT_C.htstream_Aligned.sortedByCoord.out.bam* /share/workshop/$USER/rnaseq_example/HTS_testing
  1. Transfer mouse_110_WT_C.htstream_Aligned.sortedByCoord.out.bam and mouse_110_WT_C.htstream_Aligned.sortedByCoord.out.bam.bai (the index file) to your computer using scp,FileZilla or winSCP.

Windows users can use WinSCP or FileZilla, both of which are GUI based.

Mac/Linux users can use scp. In a new shell session on my laptop. NOT logged into tadpole. Replace [your_username] with your username.

mkdir ~/rnaseq_workshop
cd ~/rnaseq_workshop
scp [your_username]@tadpole.genomecenter.ucdavis.edu:/share/workshop/mrnaseq_workshop/[your_username]/rnaseq_example/HTS_testing/mouse_110_WT_C.htstream_Aligned.sortedByCoord.out.bam* .

Its ok if the mkdir command fails (“File exists”) because we aleady created the directory earlier.

  1. Now we are ready to use IGV.

Go to the IGV page at the Broad Institute.

index_igv1

And then navigate to the download page, IGV download

index_igv2

Here you can download IGV for your respective platform (Window, Mac OSX, Linux), but we are going to use the web application they supply, IGV web app.

index_igv3

  1. The first thing we want to do is load the Mouse genome. Click on “Genomes” in the menu and choose “Mouse (GRCm39/mm39)”.

index_igv4

  1. Now let’s load the alignment bam and index files. Click on “Tracks” and choose “Local File …”.

index_igv5

Navigate to where you transferred the bam and index file and select them both.

index_igv6

Now your alignment is loaded. Any loaded bam file aligned to a genome is called a “track”.

index_igv7

  1. Lets take a look at the alignment associated with the gene Fn1, and if for some reason it doesn’t find HBB (web IGV can be fickle) go to position chr1:71,610,633-71,628,073. If you don’t see any reads, this likely means your in the wrong genome, double check that it says mm39 in the top left.

index_igv8

index_igv9

You can zoom in by clicking on the plus sign (top right) or zoom out by clicking the negative sign (click it twice). You also may have to move around by clicking and dragging in the BAM track window.

You can also zoom in by clicking and dragging across the number line at the top. That section will highlight, and when you release the button, it will zoom into that section.

index_igv10

index_igv11

Reset the window by searching for Fn1 again, and then zoom in 2 steps.

index_igv12

  1. See that the reads should be aligning within the exons in the gene. This makes sense, since RNA-Seq reads are from exons. Play with the settings on the right hand side a bit and selecting reads.

index_igv13

index_igv14

index_igv15

index_igv16

index_igv17


Running STAR on the experiment

  1. We can now run STAR across all samples on the real data using a SLURM script, star.slurm, that we should take a look at now.
cd /share/workshop/$USER/rnaseq_example  # We'll run this from the main directory
wget https://raw.githubusercontent.com/ucsf-cat-bioinformatics/2024-08-RNA-Seq-Analysis/master/software_scripts/scripts/star.sh
less star.sh
#!/bin/bash

## assumes star version 2.7.11b
## assumes STAR is available on the Path

start=`date +%s`
echo $HOSTNAME

outpath='02-STAR_alignment'
[[ -d ${outpath} ]] || mkdir ${outpath}

REF="References/star.overlap100.gencode.M35"

for sample in `cat samples.txt`
do
  [[ -d ${outpath}/${sample} ]] || mkdir ${outpath}/${sample}
  echo "SAMPLE: ${sample}"

  call="STAR
       --runThreadN 8 \
       --genomeDir $REF \
       --outSAMtype BAM SortedByCoordinate \
       --readFilesCommand zcat \
       --readFilesIn 01-HTS_Preproc/${sample}/${sample}_R1.fastq.gz 01-HTS_Preproc/${sample}/${sample}_R2.fastq.gz \
       --quantMode GeneCounts \
       --outFileNamePrefix ${outpath}/${sample}/${sample}_ \
       > ${outpath}/${sample}/${sample}-STAR.stdout 2> ${outpath}/${sample}/${sample}-STAR.stderr"

  echo $call
  eval $call
done

end=`date +%s`
runtime=$((end-start))
echo $runtime

When you are done, type “q” to exit.

  1. After looking at the script, lets run it.
bash star.sh  # moment of truth!

Quality Assurance - Mapping statistics as QA/QC.

  1. Once your jobs have finished successfully.

Use a script of ours, star_stats.sh to collect the alignment stats. Don’t worry about the script’s contents at the moment; you’ll use very similar commands to create a counts table in the next section. For now:

cd /share/workshop/$USER/rnaseq_example # We'll run this from the main directory
wget https://raw.githubusercontent.com/ucsf-cat-bioinformatics/2024-08-RNA-Seq-Analysis/master/software_scripts/scripts/star_stats.sh
bash star_stats.sh
#!/bin/bash

echo -en "sample_names" > names.txt
echo -en "total_in_feature\t" > totals.txt
cat 02-STAR_alignment/*/*ReadsPerGene.out.tab | head -4 | cut -f1 > stats.txt
cat samples.txt | while read sample; do
    echo ${sample}
    echo -en "\t${sample}" >> names.txt
    head -4 02-STAR_alignment/${sample}/${sample}_ReadsPerGene.out.tab | cut -f4 > temp1
    paste stats.txt temp1 > temp2
    mv temp2 stats.txt
    tail -n +5 02-STAR_alignment/${sample}/${sample}_ReadsPerGene.out.tab | cut -f4 | \
        perl -ne '$tot+=$_ }{ print "$tot\t"' >> totals.txt
done
echo -en "\n" >> names.txt
cat names.txt stats.txt totals.txt > temp1
mv temp1 summary_star_alignments.txt
rm stats.txt
rm names.txt
rm totals.txt

Transfer summary_star_alignments.txt to your computer.

Or in case of emergency, download this copy: summary_star_alignments.txt

Questions:

  1. Look through the files in an output directory and check out what is present and discuss what each of them mean. (for example: cd /share/workshop/$USER/rnaseq_example/02-STAR_alignment/mouse_110_WT_C )
  2. Come up with a brief command you might use to check that all of the sample alignments using STAR have a reasonable output and/or did not produce any errors.
  3. Open summary_star_alignments.txt in excel (or excel like application), and review. The table that this script creates (“summary_star_alignments.txt”). Are all samples behaving similarly? Discuss …
  4. If time, find some other regions/genes with high expression using IGV with your group. (Looking at genes the paper references is a great place to start)