1 Davis Auto Aligner Documentation

This Auto Aligner script is intended to perform bulk read mapping alignment and quantification of multiple high-throughput RNA-seq data samples. It is written in the Python programming language, and is executed within the conda python environment system on Linux/Unix or any Linux-based distribution. For help setting up the conda environment, refer to the appropriate section of this guide. The aligner script performs two critical tasks: adapter trimming and sequence read alignment. It does so by calling the relevant tool for each sample and, using a for loop, applies the same process to each sample within the directory the script is executed in. This documentation provides instruction for the use of the script in several use cases, as well as a detailed explanation of the script components. Lastly, there is guidance for setting up the conda environment, installing the necessary packages, and set up prior to beginning any alignment.

1.1 Quick Start

To use the script, open python in the command line and execute the script by simply writing python and pointing to the script file path:

python /file/path/DavisAutoAligner.py

Before executing, you must confirm the following within the script:

  1. Define OutputDir with your Output Directory

  2. Comment/Un-comment contamination filtering section:

a. Update `commandTrim` argument with either `clean1`/`clean2` (contamination filtering) or `currentFile`/`pairedFile`(no contamination filtering)
  1. Check your commandTrim parameters, such as quality score, length, etc.

  2. Then check your commandAlign parameters, add any additional options as needed for your use case. Make sure your Genome file path is correct

I prefer to set up a detected directory to the current alignment, with a directory housing all the fastq files and a directory for all output files, like so:

rnaseq
    ├── fastq
    │   ├── sample1.R1.fastq
    │   ├── sample1.R2.fastq
    │   ├── sample2.R1.fastq
    │   ├── sample2.R2.fastq
    │   ├── sample3.R1.fastq
    │   └── sample3.R2.fastq
    ├── Output
    

Then, navigate into the fastq directory and execute the script there. See an example workflow:

$ pwd
/home/you
$ mkdir project && ls -l
  drwxr-xr-x 1 you you 0 Nov 20 2024 project
  -rwxr-xr-x 1 you you 3407 Jul 10 2023 Aligner.py
$ cd project
$ mkdir fastq ##download your fastq files into this directory
$ mkdir Output && ls -l
  drwxr-xr-x 1 you you 4096 Nov 20 2024 fastq
  drwxr-xr-x 1 you you 0 Nov 20 2024 Output
$ cd fastq
$ python /home/you/Aligner.py

1.2 The Aligner Script Explained

This section will provide step-by-step insight into how the aligner works.

Below is the aligner script without annotations:

#!/home/miniconda3/envs/bin/python
import os
outputDir=""
for fastq in os.scandir():
    if fastq.path.endswith(".fastq.gz") and "R2" in fastq.path:
        pass
    if fastq.path.endswith(".fastq.gz") and "R1" in fastq.path:
        currentFile= fastq.path
        pairedFile=currentFile.replace("R1","R2")
        prefix=currentFile[currentFile.find("/")+1:currentFile.rfind(".")]
        fileDir=outputDir+ "/" +str(prefix)
        os.mkdir(fileDir) 
        print(fileDir)
        fileQC=fileDir+ "/QC"
        os.mkdir(fileQC)
        commandQC = "fastqc -o "+fileQC+ " "+str(currentFile)
        os.system(commandQC)
        print(commandQC)
        commandQC2 = "fastqc -o "+fileQC+ " "+str(pairedFile)
        os.system(commandQC2)
        print(commandQC2)
        # bbmap filtering
        commandbbduk = "bbduk.sh in1="+str(currentFile)+" in2="+str(pairedFile)+" out1=clean1.fastq.gz out2=clean2.fastq.gz ref='/home/ref/humanrRNA.fa.gz','/home/ref/polyT.fa.gz','/home/ref/rRNA.fa','/home/ref/sheep_satellite.fa'"
        clean1 = "clean1.fastq.gz"
        clean2 = "clean2.fastq.gz"
        os.system(commandbbduk)
        print(commandbbduk)
        #running trim galore
        commandTrim="trim_galore --paired --cores # --quality 20 --illumina --fastqc --length 15 "+ str(clean1)+ " "+str(clean2)
        os.system(commandTrim) 
        #aligning via STAR
        for search in os.scandir():
            if "val_1" in search.name and ".fq" in search.name:
                val1= search.name
            if "val_2" in search.name and ".fq" in search.name:
                val2= search.name
        commandAlign="STAR --runThreadN # --genomeDir /home/Genome --readFilesIn "+str(val1)+" "+str(val2)+" --readFilesCommand zcat --quantMode GeneCounts --genomeLoad LoadAndKeep --outFileNamePrefix " +str(prefix) 
        print(commandAlign)
        os.system(commandAlign)
        #organize remaining files
        for file in os.scandir():
            print(file.path)
            if file.path.endswith(".txt"):
                fileName=file.path[file.path.rfind("/")+1:]
                newDir=fileDir+"/"+fileName
                os.rename(file.path, newDir)
            if file.path.endswith(".out"):
                fileName=file.path[file.path.rfind("/")+1:]
                newDir=fileDir+"/"+fileName
                os.rename(file.path, newDir)
            if file.path.endswith(".bam"):
                fileName=file.path[file.path.rfind("/")+1:]
                newDir=fileDir+"/"+fileName
                os.rename(file.path, newDir)
            if file.path.endswith(".zip"):
                fileName=file.path[file.path.rfind("/")+1:]
                newDir=fileDir+"/"+fileName
                os.rename(file.path, newDir)
            if file.path.endswith(".fq.gz"):
                fileName=file.path[file.path.rfind("/")+1:]
                newDir=fileDir+"/"+fileName
                os.rename(file.path, newDir)
            if file.path.endswith(".html"):
                fileName=file.path[file.path.rfind("/")+1:]
                newDir=fileDir+"/"+fileName
                os.rename(file.path, newDir)
            if file.path.endswith(".tab"):
                fileName=file.path[file.path.rfind("/")+1:]
                newDir=fileDir+"/"+fileName
                os.rename(file.path, newDir)
            if file.path.endswith("clean1.fastq.gz"):
                fileName=file.path[file.path.rfind("/")+1:]
                newDir=fileDir+"/"+fileName
                os.rename(file.path, newDir)
            if file.path.endswith("clean2.fastq.gz"):
                fileName=file.path[file.path.rfind("/")+1:]
                newDir=fileDir+"/"+fileName
                os.rename(file.path, newDir)
            if file.path.endswith(".mate1"):
                fileName=file.path[file.path.rfind("/")+1:]
                newDir=fileDir+"/"+fileName
                os.rename(file.path, newDir)
            if file.path.endswith(".mate2"):
                fileName=file.path[file.path.rfind("/")+1:]
                newDir=fileDir+"/"+fileName
                os.rename(file.path, newDir)
    #final QC
    commandmultiQC = "multiqc "+fileDir+" -o "+fileQC
    os.system(commandmultiQC)

To start, given that we are using python in the conda environment, you may encounter an error running the script if the incorrect python version is called as a default. As such, a shebang at the top of the script to point to the anacoda python distribution can be helpful to avoid or resolve this error. This is an example shebang:

#!/home/yourusername/anaconda/bin/python

You want to find the filepath for python in the conda python distribution installed, typically this form:

#!/home/yourusername/miniconda3/envs/bin/python

After that, we import our necessary package:

import os

This package allows us to use operating system functionality, as such allowing us to execute all command line arguments we generate in the script.

Next, define your output directory. If left undefined, the script will error out.

outputDir="/home/rnaseq/output"

After this set-up, the aligner begins searching the current directory using os.scandir(). The current directory should be a folder of several fastq files. The defined for loop is designed to have an action for each fastq in the directory. For a paired end read search, the parameter is a file name based on “R1” and “R2” naming scheme. The aligner can be updated for a single end read search, where a singular if statement defines the currentFile and applies the alignment process to it.

The current script assumes a paired-end read, thus, as each file is assessed, the if statements determine if it is R1 - forward read, or R2 - reverse read. We define these files as currentFile and pairedFile respectively. It is crucial that the correct two paired files are identified.

Note that this loop is searching for fastq.gz files, which are compressed fastq files that is possibly the form that will be downloaded and found in your directory. Confirm whether your files are compressed or uncompressed. If uncompressed, then the search would simply be updated to “.fastq” since none of the file names will end with “.gz”.

for fastq in os.scandir():
    if fastq.path.endswith(".fastq.gz") and "R2" in fastq.path:
        pass
    if fastq.path.endswith(".fastq.gz") and "R1" in fastq.path:
        currentFile= fastq.path
        pairedFile=currentFile.replace("R1","R2")
        prefix=currentFile[currentFile.find("/")+1:currentFile.rfind(".")]
        fileDir=outputDir+ "/" +str(prefix)
        os.mkdir(fileDir) 
        print(fileDir)

If you want to unzip all the fastq files, in the command line, you can run the following command:

$ gunzip *.fastq.gz

It will take some time for this to run, especially if there are a lot of files in the folder. These fastq files even when compressed are a couple gigabytes.

Once the two paired files are found and assigned, then we use the existing sample nomenclature to name the output folder in the output directory defined by User. We create the directory with os.mkdir(). The progression of the script is followed both with the on-screen outputs of the respective tools, as well as the print() information within the loop.

To ensure pre- and post- quality comparison, a FastQC is run on both samples prior to quality and adapter trimming. The results are organized in a sub folder within the respective output folder named “QC”, created again using os.mkdir().

  fileQC=fileDir+ "/QC"
  os.mkdir(fileQC)
  commandQC = "fastqc -o "+fileQC+ " "+str(currentFile)
  os.system(commandQC)
  print(commandQC)
  commandQC2 = "fastqc -o "+fileQC+ " "+str(pairedFile)
  os.system(commandQC2)
  print(commandQC2)

The command line arguments are passed via os.system and require generating the argument as a String variable. In this case, the arguments are commandQC and commandQC2 for the “R1” and “R2” files. Options can be added directly to the string in commandQC if it was necessary, such as --dup_length or -t depending on what is being accomplished. Find out more about FastQC options by running fastqc --help in the command line. Be wary of spaces when making edits to your argument. This can be a common source of errors.

For commandQC, we are running the “fastqc” package and asking it to output the results (-o) to the specified filepath, which we defined as fileQC, on the currentFile which is the “R1” of this sample. commandQC2 is identical but applied to the “R2” pairedFile.

After quality control, we offer contamination filtering using the BBMap Suite. This suite offers several tools, including alignment and quality trimming. However, we found utility using their BBDuk.sh tool for removal of potential contamination in samples. Attempting to remove potential contaminants from low-quality RNA-seq data can help improve alignment percentage, especially with small libraries.

For the purposes of this use case, the BBDuk.sh tool requires defining your input files, naming your output files and pointing to a reference source for “contaminants”. In this aligner, the contamination removed were human ribosomal RNA, polyT sequences, sheep ribosomal RNA, and sheep satellite DNA.

For more information about BBDuk and the BBMap Suite, visit their guide on the Joint Genome Institute website.

  # bbmap filtering
  commandbbduk = "bbduk.sh in1="+str(currentFile)+" in2="+str(pairedFile)+" out1=clean1.fastq.gz out2=clean2.fastq.gz ref='/home/ref/humanrRNA.fa.gz','/home/ref/polyT.fa.gz','/home/ref/rRNA.fa','/home/ref/sheep_satellite.fa'"
  clean1 = "clean1.fastq.gz"
  clean2 = "clean2.fastq.gz"
  os.system(commandbbduk)
  print(commandbbduk)

After commandbbduk runs using os.system(), clean1 and clean2 are the variables associated with the “R1” and “R2” samples respectively. Without contamination filtering, currentFile and pairedFile remain the appropriate variables, and must be passed into the quality trimming step. If contamination filtering is not needed, the above code chunk can be commented out of the script, but be sure to update the variables in the quality trimming argument.

Quality and adapter trimming is performed using the wrapper package Trim Galore. Trim Galore combines cutadapt and fastqc to provide quality and adapter trimming with simultaneous QC.

  #running trim galore
  commandTrim="trim_galore --paired --cores # --quality 20 --illumina --fastqc --length 15 "+ str(clean1)+ " "+str(clean2)
  os.system(commandTrim) 

In commandTrim, we have set some important options. Firstly, this is a --paired read, as such we are passing two files in, clean1 and clean2 (or currentFile/pairedFile without contamination filtering). Passing a single file for a single end read without updating the “–paired” option can be a source of error, and vice versa. The --quality and --length parameters are important for defining the stringency of what reads are included after trimming. Quality scores are based on Phred 33, the higher the quality score threshold the more bases get removed. Phred 64 can be used with the --phred64 option. With length, the longer the length the more reads get removed. With this aligner, every sample gets a one-size fits all treatment, as such revisiting the pre-QC for certain problem samples can help determine if this step needs to be adjusted and a single re-alignment needs to be done. Quality trimming is performed first, followed by adapter trimming. Trim galore will try to auto-detect the adapter if not defined, in this case we defined the --illumina adapter. You can also define the adapter using -a , such as in the miRNA alignment use case, using the QIAseq miRNA adapter. commandTrim is passed to the command line using os.system().

For more information about the Trim Galore package and possible arguments, visit the GitHub of FelixKreuger of the Babraham Institute.

The most important part of the aligner script is the alignment process itself. We begin by defining a for loop, which creates a nested loop, that searches the current directory. As a refresher, the current directory is filled with several fastq files, but now it has several output files from Trim Galore and BBDuk. More importantly, it now has the trimmed output files for the samples being worked on in the current iteration. Fortunately, cutadapt names the output in an easy to locate manner, as such using os.scandir() again, this nested for loop identifies both output files and attaches them to new and separate variables to pass into the alignment command. These are “val1” and “val2” identified by their unique identifier “val_1” and “val_2” in their file name.

The STAR aligner is the preferred alignment tool in this workflow. The purpose of alignment is to determine where on the genome our reads originated from. The STAR aligner performs this task using a strategy that accounts for spliced alignments, hence it’s name STAR (Spliced Transcripts Alignment to a Reference). In addition, STAR doubles as a quantification tool, providing counts for the number of reads it aligns to a specific gene.

The simple argument for STAR alignment is: STAR --genomeDir /Genome --readFilesIn File1 File2 --quantMode GeneCounts --outFileNamePrefix Sample1

The additional arguments help with some efficiency. The --runThreadsN argument sets the number of cores available. By default, it is 1, thus if your machine or server has several cores available, then consider defining this option for higher processing power. The --readFilesCommand zcat is used to decompress .gz files if they are passed through STAR. This will occur if the original fastq was compressed, since Trim Galore will preserve file compression. Thus, if in the initial search, the parameters were changed from .fastq.gz to just .fastq, then there is no need for --readFilesCommand zcat. The --genomeLoad LoadAndKeep argument helps with memory sharing so that the Genome doesn’t need to be continuously loaded into memory every time STAR is run. Since this aligner is a loop, the first iteration allows the Genome to be loaded and each iteration will go slightly faster.

#aligning via STAR
for search in os.scandir():
    if "val_1" in search.name and ".fq" in search.name:
        val1= search.name
    if "val_2" in search.name and ".fq" in search.name:
        val2= search.name
commandAlign="STAR --runThreadN # --genomeDir /home/Genome --readFilesIn "+str(val1)+" "+str(val2)+" --readFilesCommand zcat --quantMode GeneCounts --genomeLoad LoadAndKeep --outFileNamePrefix " +str(prefix) 
print(commandAlign)
os.system(commandAlign)

The commandAlign argument is a source of many errors, thus, it is important to check this string prior to saving and executing your script. After it is compiled, the commandAlign argument gets passed to the command line with os.system() and our samples then start being aligned with the STAR aligner.

Before STAR can be used for alignment, it requires indexing of the genome of interest. This must be done before you can do this alignment script. It will be covered again in the set-up section, but you can see an example command for the human genome:

$ STAR --runThreadN 12 --runMode genomeGenerate ---genomeDir ~/Genome --genomeFastaFiles ~/Genome/GRCh38.p14_genomic.fna --sjdbGTFfile ~/Genome/GRCh38.p14_genomic.gtf

STAR is incredibly versatile and robust, and has a number of applications. Refer to the STAR manual by Dr. Alexander Dobin for more information about arguments, options, and use cases.

After the alignment is complete, this script prepares for another iteration by organizing the current directory. Since there are several output files, some of which can potentially confound the next search, we again generate another nested for loop with os.scandir(). We check each file for its file type, and using os.rename() it is transferred to the output directory if it is not one that belongs in the current directory. Recall that we only had fastq files in this directory when we first called the script, so we ideally would like there to only be fastq files when the main for loop begins again.

 #organize remaining files
for file in os.scandir():
    print(file.path)
    if file.path.endswith(".txt"):
        fileName=file.path[file.path.rfind("/")+1:]
        newDir=fileDir+"/"+fileName
        os.rename(file.path, newDir)
    if file.path.endswith(".out"):
        fileName=file.path[file.path.rfind("/")+1:]
        newDir=fileDir+"/"+fileName
        os.rename(file.path, newDir)
    if file.path.endswith(".bam"):
        fileName=file.path[file.path.rfind("/")+1:]
        newDir=fileDir+"/"+fileName
        os.rename(file.path, newDir)
    if file.path.endswith(".zip"):
        fileName=file.path[file.path.rfind("/")+1:]
        newDir=fileDir+"/"+fileName
        os.rename(file.path, newDir)
    if file.path.endswith(".fq.gz"):
        fileName=file.path[file.path.rfind("/")+1:]
        newDir=fileDir+"/"+fileName
        os.rename(file.path, newDir)
    if file.path.endswith(".html"):
        fileName=file.path[file.path.rfind("/")+1:]
        newDir=fileDir+"/"+fileName
        os.rename(file.path, newDir)
    if file.path.endswith(".tab"):
        fileName=file.path[file.path.rfind("/")+1:]
        newDir=fileDir+"/"+fileName
        os.rename(file.path, newDir)
    if file.path.endswith("clean1.fastq.gz"):
        fileName=file.path[file.path.rfind("/")+1:]
        newDir=fileDir+"/"+fileName
        os.rename(file.path, newDir)
    if file.path.endswith("clean2.fastq.gz"):
        fileName=file.path[file.path.rfind("/")+1:]
        newDir=fileDir+"/"+fileName
        os.rename(file.path, newDir)
    if file.path.endswith(".mate1"):
        fileName=file.path[file.path.rfind("/")+1:]
        newDir=fileDir+"/"+fileName
        os.rename(file.path, newDir)
    if file.path.endswith(".mate2"):
        fileName=file.path[file.path.rfind("/")+1:]
        newDir=fileDir+"/"+fileName
        os.rename(file.path, newDir)

Lastly, we do a final quality check using MultiQC. This tool provides a single aggregate report across several files, therefore it provides an overview assessment of a sample’s alignment. It must be run after the organization of the files, as MultiQC works by reading the contents of a directory and identifying any log files. Thus, commandmultiQC is simply calling the “multiqc” package to run on this sample’s output folder and asking it to output (-o) it’s own files in the QC folder (which we created earlier).

#final QC
commandmultiQC = "multiqc "+fileDir+" -o "+fileQC
os.system(commandmultiQC)

We then pass commandmultiQC to the command line with os.system(). MultiQC was developed and is maintained by Phil Ewels, learn more about it here.

After the MultiQC concludes, the for loop repeats, identifying the next paired samples and repeating the process. Typically, if the STAR aligner is reached and it does not error out, then there is a strong chance everything will work smoothly. Errors after that will typically arise as a result of the data more than the aligner. Some issues can be large library discrepancy in the R1 and R2, an issue in the file name, one odd-one-out compressed/uncompressed file, corrupted fastq, etc.