Get chromosome sizes from fasta file 6 Entering edit mode 8.6 years ago rioualen ▴ 750 Hello, I'm wondering whether there is a program that could calculate chromosome sizes from any fasta file? The idea is to generate a tab file like the one expected in bedtools genomecov for example. I know there's the fetchChromSize program from UCSC, but not all genomes are available over there (I need TAIR10 for instance). I've read this topic already. I would like a tool that can deal with any genome regardless of the database. If it doesn't exist I guess it's possible to just parse fasta files, but I'd be surprised if no one else had done it before! Cheers genome ucsc • 45k views ADD COMMENT • link updated 5 days ago by Dave Carlson ★ 1.9k • written 8.6 years ago by rioualen ▴ 750 24 Entering edit mode That's what you're looking for if you want to use ADD COMMENT • link updated 5.3 years ago by Ram 44k • written 8.6 years ago by Matt Shirley 10k Entering edit mode Looks perfect! Thank you. ADD REPLY • link updated 5.3 years ago by Ram 44k • written 8.6 years ago by rioualen ▴ 750 30 Entering edit mode 8.6 years ago rioualen ▴ 750 Just found out faidx is available with samtools so another possibility is: ADD COMMENT • link updated 5.3 years ago by Ram 44k • written 8.6 years ago by rioualen ▴ 750 1 Entering edit mode ALSo can be in this format at once: ADD REPLY • link 3.9 years ago by Apex92 ▴ 300 Entering edit mode Tried this. Didn't work because Unless maybe you use some type of flags in the command, maybe you could use pipes to run it as a one-line command as Apex92 tried. ADD REPLY • link 3.0 years ago by Pratik ★ 1.0k Entering edit mode This is the only line that worked for me, except the line only produced a output.fa.fai file which was identical to the chromsizes file i was looking except for extra columns (ie cut -f1,2 didnt work in this line) samtools faidx input.fa | cut -f1,2 > chromsizes However after getting that .fa.fai file i was able to cut out just the columns i needed (chromosome name & size) to produce the .sizes files i needed (also correction to the above the cut command needs to be given an input and output files). cut -f1,2 Hel_final_2016_new.fa.fai > Hel_final_2016_new.fa.sizes As a result you can run this line twice to make it work (since the first time it runs it will create the .fai file that the second use of the line will require): samtools faidx Hel_final_2016_new.fa | cut -f1,2 Hel_final_2016_new.fa.fai > Hel_final_2016_new.fa.sizes ADD REPLY • link 21 months ago by jamesogilvie1 • 0 Entering edit mode the pipe in this case is not good practice here. there is no piping being used. replace with a semicolon, or "&&" ADD REPLY • link 5 days ago by cmdcolin ★ 3.9k Entering edit mode samtools faidx does not work for this purpose. ADD REPLY • link 2.2 years ago by tomas4482 ▴ 420 2 Entering edit mode 5.3 years ago gbdias ▴ 150 Another good option is to use the faSize command from UCSC tools. It works on any fasta file, not just UCSC genomes. You run it like this: Default behavior is to print to STDOUT, so you can redirect the output to a file like this: The output will be a tab delimited file with sequence name on the first column and the length on the second. You can easily install faSize and other tools from UCSC using Bioconda https://anaconda.org/bioconda/ucsc-fasize ADD COMMENT • link 5.3 years ago by gbdias ▴ 150 1 Entering edit mode 5 days ago alejandrogzi ▴ 140 Hi all, check this: https://github.com/alejandrogzi/chromsize build in rust, the simplest and fastest way to get them. Works as a binary, library and is ported to python. Here is the benchmark with a lot of tools: ADD COMMENT • link 5 days ago by alejandrogzi ▴ 140 1 Entering edit mode I guessed before checking that this tool was written in rust. I was right. :) ADD REPLY • link 5 days ago by Dave Carlson ★ 1.9k Entering edit mode 5.9 years ago aleferna ▴ 10 samtools was crashing because I included the HLA's, Finally just did: ADD COMMENT • link updated 5.3 years ago by Ram 44k • written 5.9 years ago by aleferna ▴ 10 Entering edit mode This will be much faster if you have a lot of sequences since none of the ADD REPLY • link 5.9 years ago by Matt Shirley 10k pip install pyfaidxfaidx input.fasta -i chromsizes > sizes.genome
bedtools genomecov
, but you can transform to a BED file as well using -i bed
.samtools faidx input.facut -f1,2 input.fa.fai > sizes.genome
samtools faidx genome.fa | cut -f1,2 > chromsizes
samtools faidx genome.fa
outputs to genome.fa.fai
. Therefore, the command by rioualen should be the one used.faSize -detailed -tab file.fasta
faSize -detailed -tab file.fasta > output.txt
Usage
Binary
Usage: chromsize --fasta <FASTA> --output <OUTPUT> [-t <THREADS>]Arguments: -f, --fasta <FASTA>: FASTA file -o, --output <OUTPUT>: path to chrom.sizesOptions: -t, --threads <THREADS>: number of threads [default: your max ncpus] --help: print help --version: print version
HLA-A*01:01:38L 3374
Traceback (most recent call last): File "/usr/local/bin/faidx", line 11, in <module> load_entry_point('pyfaidx==0.5.2', 'console_scripts', 'faidx')() File "/usr/local/lib/python2.7/dist-packages/pyfaidx/cli.py", line 197, in main write_sequence(args) File "/usr/local/lib/python2.7/dist-packages/pyfaidx/cli.py", line 50, in write_sequence outfile.write(transform_sequence(args, fasta, name, start, end)) File "/usr/local/lib/python2.7/dist-packages/pyfaidx/cli.py", line 120, in transform_sequence line_len = fasta.faidx.index[name].lencKeyError: 'HLA-A*01'
from Bio import SeqIOfor rec in SeqIO.parse("hg38-Mix.fa","fasta"): print rec.id+"\t"+str(len(rec.seq))
faidx
was failing because the cli script expects UCSC-style regions encoded as contig:start-end
, where start-end
is optional. Since the HLA alt contains contain :
characters you ran into a parsing issue. If you want to use pyfaidx
in the same way as biopython you can do:from pyfaidx import Fastafor rec in Fasta("hg38-Mix.fa"): printrec.name + str(len(rec)))
pyfaidx
operations require reading the sequence unless you start slicing strings.
Login before adding your answer.