Aug 03, 2014 Here is an easy-to-use Mac GUI software which will convert FASTQ data to FASTA data. We believe this is the only one Mac GUI FASTQ to FASTA converter available as of now. You can convert 10GB.
Edit this page on GitHub
This page describes
Bio.SeqIO , the standard Sequence Input/Outputinterface for BioPython 1.43 and later. For implementation details, seethe SeqIO development page.
Python novices might find Peter’s introductory BiopythonWorkshop useful whichstart with working with sequence files using SeqIO.
There is a whole chapter in theTutorial(PDF) on
Bio.SeqIO , and although there is some overlap it is well worth readingin addition to this WIKI page. There is also the APIdocumentation(which you can read online, or from within Python with the helpcommand).
AimsBio.SeqIO provides a simple uniform interface to input and outputassorted sequence file formats (including multiple sequence alignments),but will only deal with sequences as SeqRecord objects. There is a sister interface Bio.AlignIO for working directly with sequence alignment files as Alignment objects.
The design was partly inspired by the simplicity of BioPerl’sSeqIO. In the long term we hope to matchBioPerl’s impressive list of supported sequence fileformats and multiple alignmentformats.
Note that the inclusion of
Bio.SeqIO (andBio.AlignIO ) in Biopython does lead to someduplication or choice in how to deal with some file formats. Forexample, Bio.Nexus will also read sequences from Nexus files - butBio.Nexus can also do much more, for example reading any phylogenetictrees in a Nexus file.
My vision is that for manipulating sequence data you should try
Bio.SeqIO as your first choice. Unless you have some very specificrequirements, I hope this should suffice.
File Formats
This table lists the file formats that
Bio.SeqIO can read, write andindex, with the Biopython version where this was first supported (orgit to indicate this is supported in our latest indevelopment code). The format name is a simple lowercase string. Wherepossible we use the same name as BioPerl’sSeqIO andEMBOSS.
With
Bio.SeqIO you can treat sequence alignment file formats just likeany other sequence file, but the new Bio.AlignIO module is designed to work with such alignment files directly. You canalso convert a set of SeqRecord objects from anyfile format into an alignment - provided they are all the same length.Note that when using Bio.SeqIO to write sequences to an alignment fileformat, all the (gapped) sequences should be the same length.
Sequence Input
The main function is
Bio.SeqIO.parse() which takes a file handle(or filename) and format name, and returns aSeqRecord iterator.This lets you do things like:
or using a handle:
In the above example, we opened the file using the built-in pythonfunction
open . The argument 'rU' means open for reading usinguniversal readline mode - this means you don’t have to worry if thefile uses Unix, Mac or DOS/Windows style newline characters. The with -statement makes sure that the file is properly closed after reading it.That should all happen automatically if you just use the filename instead.
Note that you must specify the file format explicitly, unlikeBioPerl’s SeqIO which can try to guessusing the file name extension and/or the file contents. See Explicit isbetter than implicit (The Zen ofPython).
If you had a different type of file, for example a Clustalw alignmentfile such as
opuntia.aln which contains seven sequences, the only difference is you specify'clustal' instead of 'fasta' :
Iterators are great for when you only need the records one by one, inthe order found in the file. For some tasks you may need to have randomaccess to the records in any order. In this situation, use the built inpython
list() function to turn the iterator into a list:
Another common task is to index your records by some identifier. Forsmall files we have a function
Bio.SeqIO.to_dict() to turn aSeqRecord iterator (or list) into a dictionary(in memory):
The function
Bio.SeqIO.to_dict() will use the record ID as thedictionary key by default, but you can specify any mapping you like withits optional argument, key_function .
For larger files, it isn’t possible to hold everything in memory, so
Bio.SeqIO.to_dict is not suitable. Biopython 1.52 inwardsincludes the Bio.SeqIO.index function for this situation, but youmight also consider BioSQL .
Biopython 1.45 introduced another function,
Bio.SeqIO.read() , whichlike Bio.SeqIO.parse() will expect a handle and format. It is foruse when the handle contains one and only one record, which is returnedas a single SeqRecord object. If there are norecords, or more than one, then an exception is raised:
For the related situation where you just want the first record (and arehappy to ignore any subsequent records), you can use the built-in pythonfunction
next :
Sequence Output
For writing records to a file use the function
Bio.SeqIO.write() ,which takes a SeqRecord iterator (or list),output handle (or filename) and format string:
or:
There are more examples in the following section on converting betweenfile formats.
Note that if you are writing to an alignment file format, all yoursequences must be the same length.
If you supply the sequences as a
SeqRecord iterator, then for sequential file formats like Fasta or GenBank, therecords can be written one by one. Because only one record is createdat a time, very little memory is required. See the example belowfiltering a set of records.
On the other hand, for interlaced or non-sequential file formats likeClustal, the
Bio.SeqIO.write() function will be forced toautomatically convert an iterator into a list. This will destroy anypotential memory saving from using an generator/iterator approach.
File Format Conversion
Suppose you have a GenBank file which you want to turn into a Fastafile. For example, lets consider the file
cor6_6.gb which is included in the Biopython unit tests under the GenBankdirectory.
You could read the file like this, using the
Bio.SeqIO.parse() function:
Notice that this file contains six records. Now instead of printing therecords, let’s pass the
SeqRecord iterator to the Bio.SeqIO.write() function, to turn this GenBank file into a Fasta file:
Or more concisely using the
Bio.SeqIO.convert() function (inBiopython 1.52 or later), just:
In this example the GenBank file started like this:
The resulting Fasta file looks like this:
Note that all the Fasta file can store is the identifier, descriptionand sequence.
By changing the format strings, that code could be used to convertbetween any supported file formats.
ExamplesInput/Output Example - Filtering by sequence length
While you may simply want to convert a file (as shown above), a morerealistic example is to manipulate or filter the data in some way.
For example, let’s save all the “short” sequences of less than 300nucleotides to a Fasta file:
If you know about list comprehensions then you could have writtenthe above example like this instead:
I’m not convinced this is actually any easier to understand, but it isshorter.
However,if you are dealing with very large files with thousands of records,you could benefit from using a generator expression instead. This avoidscreating the entire list of desired records in memory:
Remember that for sequential file formats like Fasta or GenBank,
Bio.SeqIO.write() will accept a SeqRecord iterator. Theadvantage of the code above is that only one record will be in memory atany one time.
However, as explained in the output section, for non-sequential fileformats like Clustal
Bio.SeqIO.write() is forced to automaticallyturn the iterator into a list, so this advantage is lost.
If this is all confusing, don’t panic and just ignore the fancy stuff.For moderately sized datasets having too many records in memory at once(e.g. in lists) is probably not going to be a problem.
Using the SEGUID checksum
In this example, we’ll use
Bio.SeqIO with theBio.SeqUtils.CheckSum module (in Biopython 1.44 or later). First ofall, we’ll just print out the checksum for each sequence in the GenBankfilels_orchid.gbk :
You should get this output:
Now lets use the checksum function and
Bio.SeqIO.to_dict() to builda SeqRecord dictionary using the SEGUID as thekeys. The trick here is to use the Python lambda syntax to create atemporary function to get the SEGUID for each SeqRecord - we can’t usethe seguid() function directly as it only works onSeq objects or strings.
Giving this output:
Random subsequences
This script will read a Genbank file with a whole mitochondrial genome(e.g. the tobacco mitochondrion, Nicotiana tabacum mitochondrion
NC_006581 ),create 500 records containing random fragments of this genome, and savethem as a fasta file. These subsequences are created using a randomstarting points and a fixed length of 200.
That should give something like this as the output file,
Writing to a string
Sometimes you won’t want to write your
SeqRecord object(s) to a file, but to a string. For example, you might bepreparing output for display as part of a webpage. If you want to writemultiple records to a single string, use StringIO to create astring-based handle. TheTutorial(PDF) has anexample of this in the SeqIO chapter.
For the special case where you want a single record as a string in agiven file format, Biopython 1.48 added a new format method:
The format method will take any output format supported by
Bio.SeqIO where the file format can be used for a single record (e.g. 'fasta' ,'tab' or 'genbank' ).
Note that we don’t recommend you use this for file output - using
Bio.SeqIO.write() is faster and more general.
Help!
If you are having problems with
Bio.SeqIO , please join thediscussion mailing list (see mailing lists).
If you think you’ve found a bug, please report it on the project’s GitHubpage.
ABI to FASTA converter works with the following file extensions:Note: You can click on any file extension link from the list below, to view its detailed information. The list of extensions used or otherwise associated with the application may not be complete, because many common file extensions on our website, such as jpg (pictures) or txt (text files), can be opened by a large number of applications, or are too general file format. However most, if not all directly associated file extensions should be listed with its appropriate program.
Although its likely, that some file extensions may be missing from the list of associated file extensions with the application, yet they can be opened, be part of, or otherwise be associated with the program.
Comments are closed.
|
AuthorWrite something about yourself. No need to be fancy, just an overview. Archives
January 2023
Categories |