Fastq To Fasta Gui Converter For Mac

4/13/2020

Aug 03, 2014 Here is an easy-to-use Mac GUI software which will convert FASTQ data to FASTA data. We believe this is the only one Mac GUI FASTQ to FASTA converter available as of now. You can convert 10GB.

Edit this page on GitHub

This page describes Bio.SeqIO, the standard Sequence Input/Outputinterface for BioPython 1.43 and later. For implementation details, seethe SeqIO development page.

Python novices might find Peter’s introductory BiopythonWorkshop useful whichstart with working with sequence files using SeqIO.

There is a whole chapter in theTutorial(PDF) onBio.SeqIO, and although there is some overlap it is well worth readingin addition to this WIKI page. There is also the APIdocumentation(which you can read online, or from within Python with the helpcommand).

Aims

Bio.SeqIO provides a simple uniform interface to input and outputassorted sequence file formats (including multiple sequence alignments),but will only deal with sequences as SeqRecordobjects. There is a sister interface Bio.AlignIOfor working directly with sequence alignment files as Alignment objects.

The design was partly inspired by the simplicity of BioPerl’sSeqIO. In the long term we hope to matchBioPerl’s impressive list of supported sequence fileformats and multiple alignmentformats.

Note that the inclusion of Bio.SeqIO (andBio.AlignIO) in Biopython does lead to someduplication or choice in how to deal with some file formats. Forexample, Bio.Nexus will also read sequences from Nexus files - butBio.Nexus can also do much more, for example reading any phylogenetictrees in a Nexus file.

My vision is that for manipulating sequence data you should tryBio.SeqIO as your first choice. Unless you have some very specificrequirements, I hope this should suffice.

File Formats

This table lists the file formats that Bio.SeqIO can read, write andindex, with the Biopython version where this was first supported (orgit to indicate this is supported in our latest indevelopment code). The format name is a simple lowercase string. Wherepossible we use the same name as BioPerl’sSeqIO andEMBOSS.

Format name	Read	Write	Index	Notes
abi	1.58	No	N/A	Reads the ABI “Sanger” capillary sequence traces files, including the PHRED quality scores for the base calls. This allows ABI to FASTQ conversion. Note each ABI file contains one and only one sequence (so there is no point in indexing the file).
abi-trim	1.71	No	N/A	Same as “abi” but with quality trimming with Mott’s algorithm.
ace	1.47	No	1.52	Reads the contig sequences from an ACE assembly file. Uses Bio.Sequencing.Ace internally
cif-atom	1.73	No	No	Uses Bio.PDB.MMCIFParser to determine the (partial) protein sequence as it appears in the structure based on the atomic coordinates.
cif-seqres	1.73	No	No	Reads a macromolecular Crystallographic Information File (mmCIF) file to determine the complete protein sequence as defined by the `_pdbx_poly_seq_scheme` records.
clustal	1.43	1.43	No	The alignment format of Clustal X and Clustal W.
embl	1.43	1.54	1.52	The EMBL flat file format. Uses Bio.GenBank internally.
fasta	1.43	1.43	1.52	This refers to the input FASTA file format introduced for Bill Pearson’s FASTA tool, where each record starts with a “>” line. Resulting sequences have a generic alphabet by default.
fasta-2line	1.71	1.71	No	FASTA format variant with no line wrapping and exactly two lines per record.
fastq-sanger or fastq	1.50	1.50	1.52	FASTQ files are a bit like FASTA files but also include sequencing qualities. In Biopython, “fastq” (or the alias “fastq-sanger”) refers to Sanger style FASTQ files which encode PHRED qualities using an ASCII offset of 33. See also the incompatible “fastq-solexa” and “fastq-illumina” variants used in early Solexa/Illumina pipelines, Illumina pipeline 1.8 produces Sanger FASTQ.
fastq-solexa	1.50	1.50	1.52	In Biopython, “fastq-solexa” refers to the original Solexa/Illumina style FASTQ files which encode Solexa qualities using an ASCII offset of 64. See also what we call the “fastq-illumina” format.
fastq-illumina	1.51	1.51	1.52	In Biopython, “fastq-illumina” refers to early Solexa/Illumina style FASTQ files (from pipeline version 1.3 to 1.7) which encode PHRED qualities using an ASCII offset of 64. For good quality reads, PHRED and Solexa scores are approximately equal, so the “fastq-solexa” and “fastq-illumina” variants are almost equivalent.
gck	1.75	No	No	The native format used by Gene Construction Kit.
genbank or gb	1.43	1.48 / 1.51	1.52	The GenBank or GenPept flat file format. Uses `Bio.GenBank` internally for parsing. Biopython 1.48 to 1.50 wrote basic GenBank files with only minimal annotation, while 1.51 onwards will also write the features table.
ig	1.47	No	1.52	This refers to the IntelliGenetics file format, apparently the same as the MASE alignment format.
imgt	1.56	1.56	1.56	This refers to the IMGT variant of the EMBL plain text file format.
nexus	1.43	1.48	No	The NEXUS multiple alignment format, also known as PAUP format. Uses `Bio.Nexus` internally.
pdb-seqres	1.61	No	No	Reads a Protein Data Bank (PDB) file to determine the complete protein sequence as it appears in the header (no dependency on `Bio.PDB` and NumPy).
pdb-atom	1.61	No	No	Uses `Bio.PDB` to determine the (partial) protein sequence as it appears in the structure based on the atom coordinate section of the file (requires NumPy).
phd	1.46	1.52	1.52	PHD files are output from PHRED, used by PHRAP and CONSED for input. Uses `Bio.Sequencing.Phd` internally.
phylip	1.43	1.43	No	PHYLIP files. Truncates names at 10 characters.
pir	1.48	1.71	1.52	A “FASTA like” format introduced by the National Biomedical Research Foundation (NBRF) for the Protein Information Resource (PIR) database, now part of UniProt.
seqxml	1.58	1.58	No	Simple sequence XML file format.
sff	1.54	1.54	1.54	Standard Flowgram Format (SFF) binary files produced by Roche 454 and IonTorrent/IonProton sequencing machines.
sff-trim	1.54	No	1.54	Standard Flowgram Format applying the trimming listed in the file.
snapgene	1.75	No	No	The native format used by SnapGene.
stockholm	1.43	1.43	No	The Stockholm alignment format is also known as PFAM format.
swiss	1.43	No	1.52	Swiss-Prot aka UniProt format. Uses `Bio.SwissProt` internally. See also the UniProt XML format.
tab	1.48	1.48	1.52	Simple two column tab separated sequence files, where each line holds a record’s identifier and sequence. For example, this is used by Aligent’s eArray software when saving microarray probes in a minimal tab delimited text file.
qual	1.50	1.50	1.52	Qual files are a bit like FASTA files but instead of the sequence, record space separated integer sequencing values as PHRED quality scores. A matched pair of FASTA and QUAL files are often used as an alternative to a single FASTQ file.
uniprot-xml	1.56	No	1.56	UniProt XML format, successor to the plain text Swiss-Prot format.
xdna	1.75	1.75	No	The native format used by Christian Marck’s DNA Strider and Serial Cloner.

With Bio.SeqIO you can treat sequence alignment file formats just likeany other sequence file, but the new Bio.AlignIOmodule is designed to work with such alignment files directly. You canalso convert a set of SeqRecord objects from anyfile format into an alignment - provided they are all the same length.Note that when using Bio.SeqIO to write sequences to an alignment fileformat, all the (gapped) sequences should be the same length.

Sequence Input

The main function is Bio.SeqIO.parse() which takes a file handle(or filename) and format name, and returns aSeqRecord iterator.This lets you do things like:

or using a handle:

In the above example, we opened the file using the built-in pythonfunction open. The argument 'rU' means open for reading usinguniversal readline mode - this means you don’t have to worry if thefile uses Unix, Mac or DOS/Windows style newline characters. The with-statement makes sure that the file is properly closed after reading it.That should all happen automatically if you just use the filename instead.

Note that you must specify the file format explicitly, unlikeBioPerl’s SeqIO which can try to guessusing the file name extension and/or the file contents. See Explicit isbetter than implicit (The Zen ofPython).

If you had a different type of file, for example a Clustalw alignmentfile such asopuntia.alnwhich contains seven sequences, the only difference is you specify'clustal' instead of 'fasta':

Iterators are great for when you only need the records one by one, inthe order found in the file. For some tasks you may need to have randomaccess to the records in any order. In this situation, use the built inpython list() function to turn the iterator into a list:

Another common task is to index your records by some identifier. Forsmall files we have a function Bio.SeqIO.to_dict() to turn aSeqRecord iterator (or list) into a dictionary(in memory):

The function Bio.SeqIO.to_dict() will use the record ID as thedictionary key by default, but you can specify any mapping you like withits optional argument, key_function.

For larger files, it isn’t possible to hold everything in memory, soBio.SeqIO.to_dict is not suitable. Biopython 1.52 inwardsincludes the Bio.SeqIO.index function for this situation, but youmight also consider BioSQL.

Biopython 1.45 introduced another function, Bio.SeqIO.read(), whichlike Bio.SeqIO.parse() will expect a handle and format. It is foruse when the handle contains one and only one record, which is returnedas a single SeqRecord object. If there are norecords, or more than one, then an exception is raised:

For the related situation where you just want the first record (and arehappy to ignore any subsequent records), you can use the built-in pythonfunction next:

Sequence Output

For writing records to a file use the function Bio.SeqIO.write(),which takes a SeqRecord iterator (or list),output handle (or filename) and format string:

or:

There are more examples in the following section on converting betweenfile formats.

Note that if you are writing to an alignment file format, all yoursequences must be the same length.

If you supply the sequences as a SeqRecorditerator, then for sequential file formats like Fasta or GenBank, therecords can be written one by one. Because only one record is createdat a time, very little memory is required. See the example belowfiltering a set of records.

On the other hand, for interlaced or non-sequential file formats likeClustal, the Bio.SeqIO.write() function will be forced toautomatically convert an iterator into a list. This will destroy anypotential memory saving from using an generator/iterator approach.

File Format Conversion

Suppose you have a GenBank file which you want to turn into a Fastafile. For example, lets consider the filecor6_6.gbwhich is included in the Biopython unit tests under the GenBankdirectory.

You could read the file like this, using the Bio.SeqIO.parse()function:

Notice that this file contains six records. Now instead of printing therecords, let’s pass the SeqRecord iterator to the Bio.SeqIO.write()function, to turn this GenBank file into a Fasta file:

Or more concisely using the Bio.SeqIO.convert() function (inBiopython 1.52 or later), just:

In this example the GenBank file started like this:

The resulting Fasta file looks like this:

Note that all the Fasta file can store is the identifier, descriptionand sequence.

By changing the format strings, that code could be used to convertbetween any supported file formats.

Examples

Input/Output Example - Filtering by sequence length

While you may simply want to convert a file (as shown above), a morerealistic example is to manipulate or filter the data in some way.

For example, let’s save all the “short” sequences of less than 300nucleotides to a Fasta file:

If you know about list comprehensions then you could have writtenthe above example like this instead:

I’m not convinced this is actually any easier to understand, but it isshorter.

However,if you are dealing with very large files with thousands of records,you could benefit from using a generator expression instead. This avoidscreating the entire list of desired records in memory:

Remember that for sequential file formats like Fasta or GenBank,Bio.SeqIO.write() will accept a SeqRecord iterator. Theadvantage of the code above is that only one record will be in memory atany one time.

However, as explained in the output section, for non-sequential fileformats like Clustal Bio.SeqIO.write() is forced to automaticallyturn the iterator into a list, so this advantage is lost.

If this is all confusing, don’t panic and just ignore the fancy stuff.For moderately sized datasets having too many records in memory at once(e.g. in lists) is probably not going to be a problem.

Using the SEGUID checksum

In this example, we’ll use Bio.SeqIO with theBio.SeqUtils.CheckSum module (in Biopython 1.44 or later). First ofall, we’ll just print out the checksum for each sequence in the GenBankfilels_orchid.gbk:

You should get this output:

Now lets use the checksum function and Bio.SeqIO.to_dict() to builda SeqRecord dictionary using the SEGUID as thekeys. The trick here is to use the Python lambda syntax to create atemporary function to get the SEGUID for each SeqRecord - we can’t usethe seguid() function directly as it only works onSeq objects or strings.

Giving this output:

Random subsequences

This script will read a Genbank file with a whole mitochondrial genome(e.g. the tobacco mitochondrion, Nicotiana tabacum mitochondrionNC_006581),create 500 records containing random fragments of this genome, and savethem as a fasta file. These subsequences are created using a randomstarting points and a fixed length of 200.

That should give something like this as the output file,

Writing to a string

Sometimes you won’t want to write your SeqRecordobject(s) to a file, but to a string. For example, you might bepreparing output for display as part of a webpage. If you want to writemultiple records to a single string, use StringIO to create astring-based handle. TheTutorial(PDF) has anexample of this in the SeqIO chapter.

For the special case where you want a single record as a string in agiven file format, Biopython 1.48 added a new format method:

The format method will take any output format supported by Bio.SeqIOwhere the file format can be used for a single record (e.g. 'fasta','tab' or 'genbank').

Note that we don’t recommend you use this for file output - usingBio.SeqIO.write() is faster and more general.

Help!

If you are having problems with Bio.SeqIO, please join thediscussion mailing list (see mailing lists).

If you think you’ve found a bug, please report it on the project’s GitHubpage.

ABI to FASTA converter works with the following file extensions:Note: You can click on any file extension link from the list below, to view its detailed information. The list of extensions used or otherwise associated with the application may not be complete, because many common file extensions on our website, such as jpg (pictures) or txt (text files), can be opened by a large number of applications, or are too general file format. However most, if not all directly associated file extensions should be listed with its appropriate program.

Although its likely, that some file extensions may be missing from the list of associated file extensions with the application, yet they can be opened, be part of, or otherwise be associated with the program.