Get chromosome sizes from fasta file (2024)

Table of Contents
Usage Binary References

Get chromosome sizes from fasta file

6

Entering edit mode

8.6 years ago

rioualen ▴ 750

Hello,

I'm wondering whether there is a program that could calculate chromosome sizes from any fasta file? The idea is to generate a tab file like the one expected in bedtools genomecov for example.

I know there's the fetchChromSize program from UCSC, but not all genomes are available over there (I need TAIR10 for instance). I've read this topic already.

I would like a tool that can deal with any genome regardless of the database. If it doesn't exist I guess it's possible to just parse fasta files, but I'd be surprised if no one else had done it before!

Cheers

genome ucsc • 45k views

ADD COMMENTlink updated 5 days ago by Dave Carlson ★ 1.9k • written 8.6 years ago by rioualen ▴ 750

1

Entering edit mode

A quick google search would help you. For e.x

http://www.danielecook.com/generate-fasta-sequence-lengths/

ADD REPLYlink updated 5.3 years ago by Ram 44k • written 8.6 years ago by GouthamAtla 12k

Entering edit mode

I did look it up on google, but thank you anyway.

ADD REPLYlink 8.6 years ago by rioualen ▴ 750

24

Entering edit mode

8.6 years ago

Matt Shirley 10k

pip install pyfaidxfaidx input.fasta -i chromsizes > sizes.genome

That's what you're looking for if you want to use bedtools genomecov, but you can transform to a BED file as well using -i bed.

ADD COMMENTlink updated 5.3 years ago by Ram 44k • written 8.6 years ago by Matt Shirley 10k

Entering edit mode

Looks perfect! Thank you.

ADD REPLYlink updated 5.3 years ago by Ram 44k • written 8.6 years ago by rioualen ▴ 750

30

Entering edit mode

8.6 years ago

rioualen ▴ 750

Just found out faidx is available with samtools so another possibility is:

samtools faidx input.facut -f1,2 input.fa.fai > sizes.genome

ADD COMMENTlink updated 5.3 years ago by Ram 44k • written 8.6 years ago by rioualen ▴ 750

1

Entering edit mode

ALSo can be in this format at once:

samtools faidx genome.fa | cut -f1,2 > chromsizes

ADD REPLYlink 3.9 years ago by Apex92 ▴ 300

Entering edit mode

Tried this. Didn't work because samtools faidx genome.fa outputs to genome.fa.fai. Therefore, the command by rioualen should be the one used.

Unless maybe you use some type of flags in the command, maybe you could use pipes to run it as a one-line command as Apex92 tried.

ADD REPLYlink 3.0 years ago by Pratik ★ 1.0k

Entering edit mode

This is the only line that worked for me, except the line only produced a output.fa.fai file which was identical to the chromsizes file i was looking except for extra columns (ie cut -f1,2 didnt work in this line)

samtools faidx input.fa | cut -f1,2 > chromsizes

However after getting that .fa.fai file i was able to cut out just the columns i needed (chromosome name & size) to produce the .sizes files i needed (also correction to the above the cut command needs to be given an input and output files).

cut -f1,2 Hel_final_2016_new.fa.fai > Hel_final_2016_new.fa.sizes

As a result you can run this line twice to make it work (since the first time it runs it will create the .fai file that the second use of the line will require):

samtools faidx Hel_final_2016_new.fa | cut -f1,2 Hel_final_2016_new.fa.fai > Hel_final_2016_new.fa.sizes

ADD REPLYlink 21 months ago by jamesogilvie1 • 0

Entering edit mode

the pipe in this case is not good practice here. there is no piping being used. replace with a semicolon, or "&&"

ADD REPLYlink 5 days ago by cmdcolin ★ 3.9k

Entering edit mode

samtools faidx does not work for this purpose.

ADD REPLYlink 2.2 years ago by tomas4482 ▴ 420

2

Entering edit mode

5.3 years ago

gbdias ▴ 150

  • Another good option is to use the faSize command from UCSC tools. It works on any fasta file, not just UCSC genomes.

  • You run it like this: faSize -detailed -tab file.fasta

  • Default behavior is to print to STDOUT, so you can redirect the output to a file like this:faSize -detailed -tab file.fasta > output.txt

  • The output will be a tab delimited file with sequence name on the first column and the length on the second.

  • You can easily install faSize and other tools from UCSC using Bioconda https://anaconda.org/bioconda/ucsc-fasize

ADD COMMENTlink 5.3 years ago by gbdias ▴ 150

1

Entering edit mode

5 days ago

alejandrogzi ▴ 140

Hi all,

check this: https://github.com/alejandrogzi/chromsize

Usage

Binary

Usage: chromsize --fasta <FASTA> --output <OUTPUT> [-t <THREADS>]Arguments: -f, --fasta <FASTA>: FASTA file -o, --output <OUTPUT>: path to chrom.sizesOptions: -t, --threads <THREADS>: number of threads [default: your max ncpus] --help: print help --version: print version

build in rust, the simplest and fastest way to get them. Works as a binary, library and is ported to python. Here is the benchmark with a lot of tools:

Get chromosome sizes from fasta file (6)

ADD COMMENTlink 5 days ago by alejandrogzi &utrif; 140

1

Entering edit mode

I guessed before checking that this tool was written in rust. I was right. :)

ADD REPLYlink 5 days ago by Dave Carlson &starf; 1.9k

Entering edit mode

5.9 years ago

aleferna &utrif; 10

samtools was crashing because I included the HLA's,

HLA-A*01:01:38L 3374
Traceback (most recent call last): File "/usr/local/bin/faidx", line 11, in <module> load_entry_point('pyfaidx==0.5.2', 'console_scripts', 'faidx')() File "/usr/local/lib/python2.7/dist-packages/pyfaidx/cli.py", line 197, in main write_sequence(args) File "/usr/local/lib/python2.7/dist-packages/pyfaidx/cli.py", line 50, in write_sequence outfile.write(transform_sequence(args, fasta, name, start, end)) File "/usr/local/lib/python2.7/dist-packages/pyfaidx/cli.py", line 120, in transform_sequence line_len = fasta.faidx.index[name].lencKeyError: 'HLA-A*01'

Finally just did:

from Bio import SeqIOfor rec in SeqIO.parse("hg38-Mix.fa","fasta"): print rec.id+"\t"+str(len(rec.seq))

ADD COMMENTlink updated 5.3 years ago by Ram 44k • written 5.9 years ago by aleferna &utrif; 10

Entering edit mode

faidx was failing because the cli script expects UCSC-style regions encoded as contig:start-end, where start-end is optional. Since the HLA alt contains contain : characters you ran into a parsing issue. If you want to use pyfaidx in the same way as biopython you can do:

from pyfaidx import Fastafor rec in Fasta("hg38-Mix.fa"): printrec.name + str(len(rec)))

This will be much faster if you have a lot of sequences since none of the pyfaidx operations require reading the sequence unless you start slicing strings.

ADD REPLYlink 5.9 years ago by Matt Shirley 10k

Login before adding your answer.

Get chromosome sizes from fasta file (2024)

References

Top Articles
APTE-ASSO.ORG - Le torchis
Buffalo Chicken Tots - Inside BruCrew Life
Encore Atlanta Cheer Competition
Missed Connections Inland Empire
Botanist Workbench Rs3
Northern Whooping Crane Festival highlights conservation and collaboration in Fort Smith, N.W.T. | CBC News
Holly Ranch Aussie Farm
Sprague Brook Park Camping Reservations
Nyuonsite
Jesus Revolution Showtimes Near Chisholm Trail 8
13 The Musical Common Sense Media
How Quickly Do I Lose My Bike Fitness?
83600 Block Of 11Th Street East Palmdale Ca
Oppenheimer Showtimes Near Cinemark Denton
Assets | HIVO Support
2016 Hyundai Sonata Price, Value, Depreciation & Reviews | Kelley Blue Book
Classic Lotto Payout Calculator
National Office Liquidators Llc
Sky X App » downloaden & Vorteile entdecken | Sky X
Houses and Apartments For Rent in Maastricht
Craigslist Free Stuff Greensboro Nc
Craigslist In Visalia California
Wemod Vampire Survivors
Contracts for May 28, 2020
Sunset Time November 5 2022
Foolproof Module 6 Test Answers
Turbo Tenant Renter Login
Sorrento Gourmet Pizza Goshen Photos
Dei Ebill
Soul Eater Resonance Wavelength Tier List
Dashboard Unt
Jersey Shore Subreddit
Jackass Golf Cart Gif
Things to do in Pearl City: Honolulu, HI Travel Guide by 10Best
3 Ways to Format a Computer - wikiHow
What Is Opm1 Treas 310 Deposit
Orange Pill 44 291
How to Get Into UCLA: Admissions Stats + Tips
2024 Ford Bronco Sport for sale - McDonough, GA - craigslist
Build-A-Team: Putting together the best Cathedral basketball team
Pawn Shop Open Now
Is Arnold Swansinger Married
Walgreens Agrees to Pay $106.8M to Resolve Allegations It Billed the Government for Prescriptions Never Dispensed
Samantha Lyne Wikipedia
Lima Crime Stoppers
6576771660
Dickdrainersx Jessica Marie
Ohio Road Construction Map
Walmart Careers Stocker
Ewwwww Gif
Erica Mena Net Worth Forbes
Primary Care in Nashville & Southern KY | Tristar Medical Group
Latest Posts
Article information

Author: Tish Haag

Last Updated:

Views: 6078

Rating: 4.7 / 5 (47 voted)

Reviews: 94% of readers found this page helpful

Author information

Name: Tish Haag

Birthday: 1999-11-18

Address: 30256 Tara Expressway, Kutchburgh, VT 92892-0078

Phone: +4215847628708

Job: Internal Consulting Engineer

Hobby: Roller skating, Roller skating, Kayaking, Flying, Graffiti, Ghost hunting, scrapbook

Introduction: My name is Tish Haag, I am a excited, delightful, curious, beautiful, agreeable, enchanting, fancy person who loves writing and wants to share my knowledge and understanding with you.