Highest scored questions

191 votes
4 answers
124k views

Why does the SARS-Cov2 coronavirus genome end in aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa (33 a's)?

The SARS-Cov2 coronavirus's genome was released, and is now available on Genbank. Looking at it... ...
Rebecca J. Stones's user avatar
52 votes
9 answers
7k views

What's the most efficient file format for the storage of DNA sequences?

I'd like to learn which format is most commonly used for storing the full human genome sequence (4 letters without a quality score) and why. I assume that storing it in plain-text format would be ...
kenorb's user avatar
  • 1,323
50 votes
6 answers
16k views

Feature annotation: RefSeq vs Ensembl vs Gencode, what's the difference?

What are the actual differences between different annotation databases? My lab, for reasons still unknown to me, prefers Ensembl annotations (we're working with transcript/exon expression estimation)...
Plasma's user avatar
  • 603
42 votes
4 answers
64k views

What is the difference between FASTA, FASTQ, and SAM file formats?

I'd like to learn the differences between 3 common formats such as FASTA, FASTQ and SAM. How they are different? Are there any benefits of using one over another? Based on Wikipedia pages, I can't ...
kenorb's user avatar
  • 1,323
35 votes
4 answers
9k views

Why does the FASTA sequence for coronavirus look like DNA, not RNA?

I'm looking at a genome sequence for 2019-nCoV on NCBI. The FASTA sequence looks like this: ...
jameshfisher's user avatar
35 votes
2 answers
3k views

Why do some assemblers require an odd-length kmer for the construction of de Bruijn graphs?

Why do some assemblers like SOAPdenovo2 or Velvet require an odd-length k-mer size for the construction of de Bruijn graph, while some other assemblers like ABySS are fine with even-length k-mers?
Kamil S Jaron's user avatar
34 votes
3 answers
28k views

Uppercase vs lowercase letters in reference genome

I am using a reference genome for mm10 mouse downloaded from NCBI, and would like to understand in greater detail the difference between lowercase and uppercase letters, which make up roughly equal ...
Scott Gigante's user avatar
28 votes
7 answers
10k views

Read length distribution from FASTA file

I have a single ~10GB FASTA file generated from an Oxford Nanopore Technologies' MinION run, with >1M reads of mean length ~8Kb. How can I quickly and efficiently calculate the distribution of read ...
Scott Gigante's user avatar
28 votes
8 answers
1k views

How to version the code and the data during the analysis?

I am currently looking for a system which will allow me to version both the code and the data in my research. I think my way of analyzing data is not uncommon, and this will be useful for many people ...
Iakov Davydov's user avatar
27 votes
4 answers
7k views

Why sequence the human genome at 30x coverage?

A bit of a historical question on a number, 30 times coverage, that's become so familiar in the field: why do we sequence the human genome at 30x coverage? My question has two parts: Who came up with ...
719016's user avatar
  • 2,374
26 votes
5 answers
4k views

What happens if a major bug is discovered in a bioinformatic package that has been used in published literature?

Yesterday I was debugging some things in R trying to get a popular Flow Cytometry tool to work on our data. After a few hours of digging into the package I discovered that our data was hitting an edge ...
Nic Barker's user avatar
24 votes
4 answers
17k views

What Ensembl genome version should I use for alignments? (e.g. toplevel.fa vs. primary_assembly.fa)

When you look at all the genome files available from Ensembl. You are presented with a bunch of options. Which one is the best to use/download? You have a combination of choices. First part options: ...
story's user avatar
  • 1,613
24 votes
4 answers
971 views

Tools for simulating Oxford Nanopore reads

Are there any free open source software tools available for simulating Oxford Nanopore reads?
Daniel Standage's user avatar
23 votes
4 answers
3k views

Are there any rolling hash functions that can hash a DNA sequence and its reverse complement to the same value?

A common bioinformatics task is to decompose a DNA sequence into its constituent k-mers and compute a hash value for each k-mer. Rolling hash functions are an appealing solution for this task, since ...
Daniel Standage's user avatar
22 votes
10 answers
7k views

What is the fastest way to calculate the number of unknown nucleotides in FASTA / FASTQ files?

I used to work with publicly available genomic references, where basic statistics are usually available and if they are not, you have to compute them only once so there is no reason to worry about ...
Kamil S Jaron's user avatar

15 30 50 per page
1
2 3 4 5
437