Highest scored questions
6,549 questions
191
votes
4
answers
124k
views
Why does the SARS-Cov2 coronavirus genome end in aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa (33 a's)?
The SARS-Cov2 coronavirus's genome was released, and is now available on Genbank. Looking at it...
...
52
votes
9
answers
7k
views
What's the most efficient file format for the storage of DNA sequences?
I'd like to learn which format is most commonly used for storing the full human genome sequence (4 letters without a quality score) and why.
I assume that storing it in plain-text format would be ...
50
votes
6
answers
16k
views
Feature annotation: RefSeq vs Ensembl vs Gencode, what's the difference?
What are the actual differences between different annotation databases?
My lab, for reasons still unknown to me, prefers Ensembl annotations (we're working with transcript/exon expression estimation)...
42
votes
4
answers
64k
views
What is the difference between FASTA, FASTQ, and SAM file formats?
I'd like to learn the differences between 3 common formats such as FASTA, FASTQ and SAM. How they are different? Are there any benefits of using one over another?
Based on Wikipedia pages, I can't ...
35
votes
4
answers
9k
views
Why does the FASTA sequence for coronavirus look like DNA, not RNA?
I'm looking at a genome sequence for 2019-nCoV on NCBI. The FASTA sequence looks like this:
...
35
votes
2
answers
3k
views
Why do some assemblers require an odd-length kmer for the construction of de Bruijn graphs?
Why do some assemblers like SOAPdenovo2 or Velvet require an odd-length k-mer size for the construction of de Bruijn graph, while some other assemblers like ABySS are fine with even-length k-mers?
34
votes
3
answers
28k
views
Uppercase vs lowercase letters in reference genome
I am using a reference genome for mm10 mouse downloaded from NCBI, and would like to understand in greater detail the difference between lowercase and uppercase letters, which make up roughly equal ...
28
votes
7
answers
10k
views
Read length distribution from FASTA file
I have a single ~10GB FASTA file generated from an Oxford Nanopore Technologies' MinION run, with >1M reads of mean length ~8Kb. How can I quickly and efficiently calculate the distribution of read ...
28
votes
8
answers
1k
views
How to version the code and the data during the analysis?
I am currently looking for a system which will allow me to version both the code and the data in my research.
I think my way of analyzing data is not uncommon, and this will be useful for many people ...
27
votes
4
answers
7k
views
Why sequence the human genome at 30x coverage?
A bit of a historical question on a number, 30 times coverage, that's become so familiar in the field: why do we sequence the human genome at 30x coverage?
My question has two parts:
Who came up with ...
26
votes
5
answers
4k
views
What happens if a major bug is discovered in a bioinformatic package that has been used in published literature?
Yesterday I was debugging some things in R trying to get a popular Flow Cytometry tool to work on our data. After a few hours of digging into the package I discovered that our data was hitting an edge ...
24
votes
4
answers
17k
views
What Ensembl genome version should I use for alignments? (e.g. toplevel.fa vs. primary_assembly.fa)
When you look at all the genome files available from Ensembl. You are presented with a bunch of options. Which one is the best to use/download?
You have a combination of choices.
First part options:
...
24
votes
4
answers
971
views
Tools for simulating Oxford Nanopore reads
Are there any free open source software tools available for simulating Oxford Nanopore reads?
23
votes
4
answers
3k
views
Are there any rolling hash functions that can hash a DNA sequence and its reverse complement to the same value?
A common bioinformatics task is to decompose a DNA sequence into its constituent k-mers and compute a hash value for each k-mer. Rolling hash functions are an appealing solution for this task, since ...
22
votes
10
answers
7k
views
What is the fastest way to calculate the number of unknown nucleotides in FASTA / FASTQ files?
I used to work with publicly available genomic references, where basic statistics are usually available and if they are not, you have to compute them only once so there is no reason to worry about ...