This Auto Aligner script is intended to perform bulk read mapping
alignment and quantification of multiple high-throughput RNA-seq data
samples. It is written in the Python programming language, and is
executed within the conda python environment system on Linux/Unix or any
Linux-based distribution. For help setting up the conda environment,
refer to the appropriate section of this guide. The aligner script
performs two critical tasks: adapter trimming and sequence read
alignment. It does so by calling the relevant tool for each sample and,
using a for loop, applies the same process to each sample
within the directory the script is executed in. This documentation
provides instruction for the use of the script in several use cases, as
well as a detailed explanation of the script components. Lastly, there
is guidance for setting up the conda environment, installing the
necessary packages, and set up prior to beginning any alignment.
To use the script, open python in the command line and execute the
script by simply writing python and pointing to the script
file path:
Before executing, you must confirm the following within the script:
Define OutputDir with your Output Directory
Comment/Un-comment contamination filtering section:
a. Update `commandTrim` argument with either `clean1`/`clean2` (contamination filtering) or `currentFile`/`pairedFile`(no contamination filtering)
Check your commandTrim parameters, such as quality
score, length, etc.
Then check your commandAlign parameters, add any
additional options as needed for your use case. Make sure your Genome
file path is correct
I prefer to set up a detected directory to the current alignment, with a directory housing all the fastq files and a directory for all output files, like so:
rnaseq
├── fastq
│ ├── sample1.R1.fastq
│ ├── sample1.R2.fastq
│ ├── sample2.R1.fastq
│ ├── sample2.R2.fastq
│ ├── sample3.R1.fastq
│ └── sample3.R2.fastq
├── Output
Then, navigate into the fastq directory and execute the script there. See an example workflow:
$ pwd
/home/you
$ mkdir project && ls -l
drwxr-xr-x 1 you you 0 Nov 20 2024 project
-rwxr-xr-x 1 you you 3407 Jul 10 2023 Aligner.py
$ cd project
$ mkdir fastq ##download your fastq files into this directory
$ mkdir Output && ls -l
drwxr-xr-x 1 you you 4096 Nov 20 2024 fastq
drwxr-xr-x 1 you you 0 Nov 20 2024 Output
$ cd fastq
$ python /home/you/Aligner.pyThis section will provide step-by-step insight into how the aligner works.
Below is the aligner script without annotations:
#!/home/miniconda3/envs/bin/python
import os
outputDir=""
for fastq in os.scandir():
if fastq.path.endswith(".fastq.gz") and "R2" in fastq.path:
pass
if fastq.path.endswith(".fastq.gz") and "R1" in fastq.path:
currentFile= fastq.path
pairedFile=currentFile.replace("R1","R2")
prefix=currentFile[currentFile.find("/")+1:currentFile.rfind(".")]
fileDir=outputDir+ "/" +str(prefix)
os.mkdir(fileDir)
print(fileDir)
fileQC=fileDir+ "/QC"
os.mkdir(fileQC)
commandQC = "fastqc -o "+fileQC+ " "+str(currentFile)
os.system(commandQC)
print(commandQC)
commandQC2 = "fastqc -o "+fileQC+ " "+str(pairedFile)
os.system(commandQC2)
print(commandQC2)
# bbmap filtering
commandbbduk = "bbduk.sh in1="+str(currentFile)+" in2="+str(pairedFile)+" out1=clean1.fastq.gz out2=clean2.fastq.gz ref='/home/ref/humanrRNA.fa.gz','/home/ref/polyT.fa.gz','/home/ref/rRNA.fa','/home/ref/sheep_satellite.fa'"
clean1 = "clean1.fastq.gz"
clean2 = "clean2.fastq.gz"
os.system(commandbbduk)
print(commandbbduk)
#running trim galore
commandTrim="trim_galore --paired --cores # --quality 20 --illumina --fastqc --length 15 "+ str(clean1)+ " "+str(clean2)
os.system(commandTrim)
#aligning via STAR
for search in os.scandir():
if "val_1" in search.name and ".fq" in search.name:
val1= search.name
if "val_2" in search.name and ".fq" in search.name:
val2= search.name
commandAlign="STAR --runThreadN # --genomeDir /home/Genome --readFilesIn "+str(val1)+" "+str(val2)+" --readFilesCommand zcat --quantMode GeneCounts --genomeLoad LoadAndKeep --outFileNamePrefix " +str(prefix)
print(commandAlign)
os.system(commandAlign)
#organize remaining files
for file in os.scandir():
print(file.path)
if file.path.endswith(".txt"):
fileName=file.path[file.path.rfind("/")+1:]
newDir=fileDir+"/"+fileName
os.rename(file.path, newDir)
if file.path.endswith(".out"):
fileName=file.path[file.path.rfind("/")+1:]
newDir=fileDir+"/"+fileName
os.rename(file.path, newDir)
if file.path.endswith(".bam"):
fileName=file.path[file.path.rfind("/")+1:]
newDir=fileDir+"/"+fileName
os.rename(file.path, newDir)
if file.path.endswith(".zip"):
fileName=file.path[file.path.rfind("/")+1:]
newDir=fileDir+"/"+fileName
os.rename(file.path, newDir)
if file.path.endswith(".fq.gz"):
fileName=file.path[file.path.rfind("/")+1:]
newDir=fileDir+"/"+fileName
os.rename(file.path, newDir)
if file.path.endswith(".html"):
fileName=file.path[file.path.rfind("/")+1:]
newDir=fileDir+"/"+fileName
os.rename(file.path, newDir)
if file.path.endswith(".tab"):
fileName=file.path[file.path.rfind("/")+1:]
newDir=fileDir+"/"+fileName
os.rename(file.path, newDir)
if file.path.endswith("clean1.fastq.gz"):
fileName=file.path[file.path.rfind("/")+1:]
newDir=fileDir+"/"+fileName
os.rename(file.path, newDir)
if file.path.endswith("clean2.fastq.gz"):
fileName=file.path[file.path.rfind("/")+1:]
newDir=fileDir+"/"+fileName
os.rename(file.path, newDir)
if file.path.endswith(".mate1"):
fileName=file.path[file.path.rfind("/")+1:]
newDir=fileDir+"/"+fileName
os.rename(file.path, newDir)
if file.path.endswith(".mate2"):
fileName=file.path[file.path.rfind("/")+1:]
newDir=fileDir+"/"+fileName
os.rename(file.path, newDir)
#final QC
commandmultiQC = "multiqc "+fileDir+" -o "+fileQC
os.system(commandmultiQC)To start, given that we are using python in the conda environment, you may encounter an error running the script if the incorrect python version is called as a default. As such, a shebang at the top of the script to point to the anacoda python distribution can be helpful to avoid or resolve this error. This is an example shebang:
You want to find the filepath for python in the conda python distribution installed, typically this form:
After that, we import our necessary package:
This package allows us to use operating system functionality, as such allowing us to execute all command line arguments we generate in the script.
Next, define your output directory. If left undefined, the script will error out.
After this set-up, the aligner begins searching the current directory
using os.scandir(). The current directory should be a
folder of several fastq files. The defined for
loop is designed to have an action for each fastq in the
directory. For a paired end read search, the parameter is a file name
based on “R1” and “R2” naming scheme. The aligner can be updated for a
single end read search, where a singular if statement
defines the currentFile and applies the alignment process
to it.
The current script assumes a paired-end read, thus, as each file is
assessed, the if statements determine if it is R1 - forward
read, or R2 - reverse read. We define these files as
currentFile and pairedFile respectively. It is
crucial that the correct two paired files are identified.
Note that this loop is searching for fastq.gz files, which are compressed fastq files that is possibly the form that will be downloaded and found in your directory. Confirm whether your files are compressed or uncompressed. If uncompressed, then the search would simply be updated to “.fastq” since none of the file names will end with “.gz”.
for fastq in os.scandir():
if fastq.path.endswith(".fastq.gz") and "R2" in fastq.path:
pass
if fastq.path.endswith(".fastq.gz") and "R1" in fastq.path:
currentFile= fastq.path
pairedFile=currentFile.replace("R1","R2")
prefix=currentFile[currentFile.find("/")+1:currentFile.rfind(".")]
fileDir=outputDir+ "/" +str(prefix)
os.mkdir(fileDir)
print(fileDir)If you want to unzip all the fastq files, in the command line, you can run the following command:
It will take some time for this to run, especially if there are a lot of files in the folder. These fastq files even when compressed are a couple gigabytes.
Once the two paired files are found and assigned, then we use the
existing sample nomenclature to name the output folder in the output
directory defined by User. We create the directory with
os.mkdir(). The progression of the script is followed both
with the on-screen outputs of the respective tools, as well as the
print() information within the loop.
To ensure pre- and post- quality comparison, a FastQC is run on both
samples prior to quality and adapter trimming. The results are organized
in a sub folder within the respective output folder named “QC”, created
again using os.mkdir().
fileQC=fileDir+ "/QC"
os.mkdir(fileQC)
commandQC = "fastqc -o "+fileQC+ " "+str(currentFile)
os.system(commandQC)
print(commandQC)
commandQC2 = "fastqc -o "+fileQC+ " "+str(pairedFile)
os.system(commandQC2)
print(commandQC2)The command line arguments are passed via os.system and
require generating the argument as a String variable. In this case, the
arguments are commandQC and commandQC2 for the
“R1” and “R2” files. Options can be added directly to the string in
commandQC if it was necessary, such as
--dup_length or -t depending on what is being
accomplished. Find out more about FastQC options by running
fastqc --help in the command line. Be wary of spaces when
making edits to your argument. This can be a common source of
errors.
For commandQC, we are running the “fastqc” package and
asking it to output the results (-o) to the specified filepath, which we
defined as fileQC, on the currentFile which is
the “R1” of this sample. commandQC2 is identical but
applied to the “R2” pairedFile.
After quality control, we offer contamination filtering using the BBMap Suite. This suite offers several tools, including alignment and quality trimming. However, we found utility using their BBDuk.sh tool for removal of potential contamination in samples. Attempting to remove potential contaminants from low-quality RNA-seq data can help improve alignment percentage, especially with small libraries.
For the purposes of this use case, the BBDuk.sh tool requires defining your input files, naming your output files and pointing to a reference source for “contaminants”. In this aligner, the contamination removed were human ribosomal RNA, polyT sequences, sheep ribosomal RNA, and sheep satellite DNA.
For more information about BBDuk and the BBMap Suite, visit their guide on the Joint Genome Institute website.
# bbmap filtering
commandbbduk = "bbduk.sh in1="+str(currentFile)+" in2="+str(pairedFile)+" out1=clean1.fastq.gz out2=clean2.fastq.gz ref='/home/ref/humanrRNA.fa.gz','/home/ref/polyT.fa.gz','/home/ref/rRNA.fa','/home/ref/sheep_satellite.fa'"
clean1 = "clean1.fastq.gz"
clean2 = "clean2.fastq.gz"
os.system(commandbbduk)
print(commandbbduk)After commandbbduk runs using os.system(),
clean1 and clean2 are the variables associated
with the “R1” and “R2” samples respectively. Without contamination
filtering, currentFile and pairedFile remain
the appropriate variables, and must be passed into the quality trimming
step. If contamination filtering is not needed, the above code chunk can
be commented out of the script, but be sure to update the variables in
the quality trimming argument.
Quality and adapter trimming is performed using the wrapper package Trim Galore. Trim Galore combines cutadapt and fastqc to provide quality and adapter trimming with simultaneous QC.
#running trim galore
commandTrim="trim_galore --paired --cores # --quality 20 --illumina --fastqc --length 15 "+ str(clean1)+ " "+str(clean2)
os.system(commandTrim) In commandTrim, we have set some important options.
Firstly, this is a --paired read, as such we are passing
two files in, clean1 and clean2 (or
currentFile/pairedFile without contamination
filtering). Passing a single file for a single end read without updating
the “–paired” option can be a source of error, and vice versa. The
--quality and --length parameters are
important for defining the stringency of what reads are included after
trimming. Quality scores are based on Phred 33, the higher the quality
score threshold the more bases get removed. Phred 64 can be used with
the --phred64 option. With length, the longer the length
the more reads get removed. With this aligner, every sample gets a
one-size fits all treatment, as such revisiting the pre-QC for certain
problem samples can help determine if this step needs to be adjusted and
a single re-alignment needs to be done. Quality trimming is performed
first, followed by adapter trimming. Trim galore will try to auto-detect
the adapter if not defined, in this case we defined the
--illumina adapter. You can also define the adapter using
-a , such as in the miRNA alignment use case, using the QIAseq
miRNA adapter. commandTrim is passed to the command
line using os.system().
For more information about the Trim Galore package and possible arguments, visit the GitHub of FelixKreuger of the Babraham Institute.
The most important part of the aligner script is the alignment
process itself. We begin by defining a for loop, which
creates a nested loop, that searches the current directory. As a
refresher, the current directory is filled with several fastq
files, but now it has several output files from Trim Galore and BBDuk.
More importantly, it now has the trimmed output files for the samples
being worked on in the current iteration. Fortunately, cutadapt names
the output in an easy to locate manner, as such using
os.scandir() again, this nested for loop
identifies both output files and attaches them to new and separate
variables to pass into the alignment command. These are “val1” and
“val2” identified by their unique identifier “val_1” and “val_2” in
their file name.
The STAR aligner is the preferred alignment tool in this workflow. The purpose of alignment is to determine where on the genome our reads originated from. The STAR aligner performs this task using a strategy that accounts for spliced alignments, hence it’s name STAR (Spliced Transcripts Alignment to a Reference). In addition, STAR doubles as a quantification tool, providing counts for the number of reads it aligns to a specific gene.
The simple argument for STAR alignment is:
STAR --genomeDir /Genome --readFilesIn File1 File2 --quantMode GeneCounts --outFileNamePrefix Sample1
The additional arguments help with some efficiency. The
--runThreadsN argument sets the number of cores available.
By default, it is 1, thus if your machine or server has several cores
available, then consider defining this option for higher processing
power. The --readFilesCommand zcat is used to decompress
.gz files if they are passed through STAR. This will occur if the
original fastq was compressed, since Trim Galore will preserve
file compression. Thus, if in the initial search, the parameters were
changed from .fastq.gz to just .fastq,
then there is no need for --readFilesCommand zcat. The
--genomeLoad LoadAndKeep argument helps with memory sharing
so that the Genome doesn’t need to be continuously loaded into memory
every time STAR is run. Since this aligner is a loop, the first
iteration allows the Genome to be loaded and each iteration will go
slightly faster.
#aligning via STAR
for search in os.scandir():
if "val_1" in search.name and ".fq" in search.name:
val1= search.name
if "val_2" in search.name and ".fq" in search.name:
val2= search.name
commandAlign="STAR --runThreadN # --genomeDir /home/Genome --readFilesIn "+str(val1)+" "+str(val2)+" --readFilesCommand zcat --quantMode GeneCounts --genomeLoad LoadAndKeep --outFileNamePrefix " +str(prefix)
print(commandAlign)
os.system(commandAlign)The commandAlign argument is a source of many errors,
thus, it is important to check this string prior to saving and executing
your script. After it is compiled, the commandAlign
argument gets passed to the command line with os.system()
and our samples then start being aligned with the STAR aligner.
Before STAR can be used for alignment, it requires indexing of the genome of interest. This must be done before you can do this alignment script. It will be covered again in the set-up section, but you can see an example command for the human genome:
$ STAR --runThreadN 12 --runMode genomeGenerate ---genomeDir ~/Genome --genomeFastaFiles ~/Genome/GRCh38.p14_genomic.fna --sjdbGTFfile ~/Genome/GRCh38.p14_genomic.gtfSTAR is incredibly versatile and robust, and has a number of applications. Refer to the STAR manual by Dr. Alexander Dobin for more information about arguments, options, and use cases.
After the alignment is complete, this script prepares for another
iteration by organizing the current directory. Since there are several
output files, some of which can potentially confound the next search, we
again generate another nested for loop with
os.scandir(). We check each file for its file type, and
using os.rename() it is transferred to the output directory
if it is not one that belongs in the current directory. Recall that we
only had fastq files in this directory when we first called the
script, so we ideally would like there to only be fastq files
when the main for loop begins again.
#organize remaining files
for file in os.scandir():
print(file.path)
if file.path.endswith(".txt"):
fileName=file.path[file.path.rfind("/")+1:]
newDir=fileDir+"/"+fileName
os.rename(file.path, newDir)
if file.path.endswith(".out"):
fileName=file.path[file.path.rfind("/")+1:]
newDir=fileDir+"/"+fileName
os.rename(file.path, newDir)
if file.path.endswith(".bam"):
fileName=file.path[file.path.rfind("/")+1:]
newDir=fileDir+"/"+fileName
os.rename(file.path, newDir)
if file.path.endswith(".zip"):
fileName=file.path[file.path.rfind("/")+1:]
newDir=fileDir+"/"+fileName
os.rename(file.path, newDir)
if file.path.endswith(".fq.gz"):
fileName=file.path[file.path.rfind("/")+1:]
newDir=fileDir+"/"+fileName
os.rename(file.path, newDir)
if file.path.endswith(".html"):
fileName=file.path[file.path.rfind("/")+1:]
newDir=fileDir+"/"+fileName
os.rename(file.path, newDir)
if file.path.endswith(".tab"):
fileName=file.path[file.path.rfind("/")+1:]
newDir=fileDir+"/"+fileName
os.rename(file.path, newDir)
if file.path.endswith("clean1.fastq.gz"):
fileName=file.path[file.path.rfind("/")+1:]
newDir=fileDir+"/"+fileName
os.rename(file.path, newDir)
if file.path.endswith("clean2.fastq.gz"):
fileName=file.path[file.path.rfind("/")+1:]
newDir=fileDir+"/"+fileName
os.rename(file.path, newDir)
if file.path.endswith(".mate1"):
fileName=file.path[file.path.rfind("/")+1:]
newDir=fileDir+"/"+fileName
os.rename(file.path, newDir)
if file.path.endswith(".mate2"):
fileName=file.path[file.path.rfind("/")+1:]
newDir=fileDir+"/"+fileName
os.rename(file.path, newDir)Lastly, we do a final quality check using MultiQC. This tool provides
a single aggregate report across several files, therefore it provides an
overview assessment of a sample’s alignment. It must be run
after the organization of the files, as MultiQC works by
reading the contents of a directory and identifying any log files. Thus,
commandmultiQC is simply calling the “multiqc” package to
run on this sample’s output folder and asking it to output (-o) it’s own
files in the QC folder (which we created earlier).
We then pass commandmultiQC to the command line with
os.system(). MultiQC was developed and is maintained by
Phil Ewels, learn more about it here.
After the MultiQC concludes, the for loop repeats,
identifying the next paired samples and repeating the process.
Typically, if the STAR aligner is reached and it does not error out,
then there is a strong chance everything will work smoothly. Errors
after that will typically arise as a result of the data more than the
aligner. Some issues can be large library discrepancy in the R1 and R2,
an issue in the file name, one odd-one-out compressed/uncompressed file,
corrupted fastq, etc.