文件名称:
BioPython_Tutorial.pdf
开发工具:
文件大小: 851kb
下载次数: 0
上传时间: 2019-08-18
详细说明:Chapter1
Introduction
1.1
WhatisBiopython?
TheBiopythonProjectisaninternationalassociationofdevelopersoffreelyavailablePython(http://www.
python.org)toolsforcomputationalmolecularbiology.Thewebsitehttp://www.biopython.orgprovides
anonlineresourceformodules,scripts,andweblinksfordevelopersofPython-basedsoftwareforlifescience
research.
Basically,wejustliketoprograminpythonandwanttomakeitaseasyaspossibletousepythonfor
bioinformaticsbycreatinghigh-quality,reusablemodulesandscripts.
1.1.1
WhatcanIfindintheBiopythonpackage
ThemainBiopythonreleaseshavelotsoffunctionality,including:
•Theabilitytoparsebioinformaticsfilesintopythonutilizabledatastructures,includingsupportfor
thefollowingformats:
–Blastoutput–bothfromstandaloneandWWWBlast
–Clustalw
–FASTA
–GenBank
–PubMedandMedline
–ExPASyfiles,likeEnzyme,ProdocandProsite
–SCOP,including‘dom’and‘lin’files
–UniGene
–SwissProt
•Filesinthesupportedformatscanbeiteratedoverrecordbyrecordorindexedandaccessedviaa
Dictionaryinterface.
•Codetodealwithpopularon-linebioinformaticsdestinationssuchas:
–NCBI–Blast,EntrezandPubMedservices
–ExPASy–ProdocandPrositeentries
•Interfacestocommonbioinformaticsprogramssuchas:
5
–StandaloneBlastfromNCBI
–Clustalwalignmentprogram.
•Astandardsequenceclassthatdealswithsequences,idsonsequences,andsequencefeatures.
•Toolsforperformingcommonoperationsonsequences,suchastranslation,transcriptionandweight
calculations.
•CodetoperformclassificationofdatausingkNearestNeighbors,NaiveBayesorSupportVector
Machines.
•Codefordealingwithalignments,includingastandardwaytocreateanddealwithsubstitution
matrices.
•Codemakingiteasytosplitupparallelizabletasksintoseparateprocesses.
•GUI-basedprogramstodobasicsequencemanipulations,translations,BLASTing,etc.
•Extensivedocumentationandhelpwithusingthemodules,includingthisfile,on-linewikidocumen-
tation,thewebsite,andthemailinglist.
•IntegrationwithBioSQL,asequencedatabaseschemaalsosupportedbytheBioPerlandBioJava
projects.
Wehopethisgivesyouplentyofreasonstodownloadandstartusi4.3. 1 Specifying the dictionary keys
4.3.2 Indexing a dictionary using the SEGUID checksum
1. Writing Sequence Files
)
4.4.1 Converting between sequence file formats
4.4.2 Converting a file of sequences to their reverse complements
34
4.4.3 Getting your SeqRecord objects as formatted strings
5 Sequence Alignment Input/ Output
37
5. 1 Parsing or Reading Sequence Alignments
37
5.1.1 Single alignments
37
5.1.2 Multiple alignments
5.1.3 Ambiguous Alignments
42
5.2 Writing Alignments
44
5.2.1 Converting between sequence alignment file formats
45
5.2.2 Getting your Alignment objects as formatted strings
6 BLAST
48
6.1 Running Blast locally
6.2 Running Blast over the Internet
6.3 Saving BLAST output
51
6.4 Parsing BLAST output
6.5 The blast record class
6.6 Deprccated BLAST parsers
54
6.6.1 Parsing plain-text BLAST output
54
6.6.2 Parsing a file full of BLAST runs
57
6.6.3 Finding a bad record somewhere in a huge file
8
6.7 Dealing with PSI-BLAST
59
6.8 Dealing with RPS-BLAST
7 Accessing NCBI's Entrez databases
60
7. 1 Entrez Guidelines
7.2 EInfo: Obtaining information about the Entrez databases
61
7.3 ESearch: Searching the entrez databases
7.4 EPost: Uploading a list of identifiers
7.5 ESummary: Retrieving summaries from primaNy Ds
7.6 EFetch: Downloading full records from Entrez
7. 7 ELink
7.8 EGQucry: Obtaining counts for scarch term
67
7.9 ESpell: Obtaining spelling suggestiONs
67
7.10 Specia. lized parsers
68
7.10.1 Parsing Medline records
68
7.11 Examples
.70
7. 11.1 PubMed and medline
..70
7.11.2 Searching, downloading, and parsing Entrez Nucleotide records
71
7.11.3 Searching, downloading, and parsing Gen Bank records
73
7.11.4 Finding the lineage of all orgallisIll
74
7. 12 Using the history and WebEnv
75
7. 12.1 Searching for and downloading sequences using the history
75
7. 12.2 Searching for and downloading abstracts using the history
76
8 Swiss-Prot, Prosite, Prodoc, and ExPASy
78
8.1 Bio SwissProt: Parsing Swiss-Prot files
78
8.1.1 Parsing Swiss-Prot records
78
8.1.2 Parsing the Swiss-Prot keyword and category list
80
8.2 Bio Prosite: Parsing Prosite records
81
8.3 Bio Prosite. Prodoc: Parsing Prodoc records
8.4 Bio ExPASy: Accessing the ExPASy server
8.4.1 Retrieving a Swiss-Prot record
83
8.4.2 Searching Swiss-Prot
83
8.1.3 Retrieving Prosite and prodoc records
84
9 Going 3D: The PDB module
86
9. 1 Structure representation
86
9.1.1 Structure
88
9. 1.2 Modcl
9. 1. 3 Chain
9.1. 4 Resid
9
9.1.5 Atom
90
9.2 Disorder
9.2.1 General appi
roac
9.2.2 Disordered at
91
9. 2. 3 Disordered residues
91
9. 3 Hetero residues
9.3.1 Associated problems
9.3.2 Water residues
9.3.3 Other hetero residues
9.4 Some random usage examples
9.5 Common problems in PDB files
9.5.1 Examples
9.5.2 Automatic correction
94
9.5.3 Fatal errors
9.6 Othcr fcaturcs
10 Bio. Pop Gen: Population genetics
96
10.1 GenePop
0.2 Coalescent simulation
10.2.1 Creating scenarios
10.2.2 Running sImcoal2
100
10.3 Other applications
101
10.3.1 FDist: Detecting selection and molecular adaptation
101
10.4 Future Developments
104
11 Supervised learning methods
105
11.1 The Logistic Regression Model
105
11.1.1 Background and Purpose
105
11.1.2 Training the logistic regressiOn nodel
106
11.1.3 Using the logistic regression model for classification
108
11.1.4 Logistic Regression, Linear Discriminant Analysis, and Support Vector Machines
110
11.2 k-Nearest Neighbors
110
11.2.1 Background and purpose
110
11.2.2 Initializing a k-nearest neighbors model
111
11.2.3 Using a k-nearest neighbors model for classification
111
11.3 Naive Bayes
113
11.4 Maximum Entropy
113
11.5 Markov models
113
12 Cookbook- Cool things to do with it
114
12.1 Sequence parsing plus simple plots
114
12.1.1 Histogram of sequence lengths
114
12.1.2 Plot of sequence GCVc
115
12.1.3 Nucleotide dot plots
116
12.2 Dealing with alignments
119
12.2.1 Clustalw
119
12.2.2 Calculating summary information
121
12.2.3 Calculating a quick consensus sequence
121
12.2.4 Position Specific Score Matrices
122
12.2.5 Information Content
123
12.2.6 Translating between Alignment formats
124
12.3 Substitution matrices
124
12.3.1 Using common substitution matrices
125
12.3.2 Creating your own substitution matrix from an alignment
125
12.4 BioSQL-storing sequences in a relational database
126
12.5 InterPro
126
13 Advanced
128
13.1 The Seq Record and SeqFeature classes
128
13.1.1 Sequence Ids and Description- dealing with Segrecords
.,128
13.1.2 Features and Annotations-Seq Features
129
13.2 Regression Testing Framework
132
13.2.1 Writing a Regression Test
133
13.3 Parser Design
13
13.4 Substitution Matrices
134
13.4.1 SubsMat
134
13.4.2 FreqTable
136
14 Where to go from here- contributing to Biopython
138
14.1 Maintaining a distribution for a platform
.138
14.2 Bug Reports Feature Requests
13
14.3 Contributing Code
139
15 Appendix: Useful stuff about Python
140
15.1 What the heck is a handle?
140
15.1.1 Creating a handle from a string
140
Chapter
Introduction
1.1 What is Biopython?
TheBiopythonProjectisaninternationalassociationofdevelopersoffreelyavailablePython(http://www
python.org)toolsforcomputationalmolecularbiologythewebsitehttp://www.biopython.orgprovides
an online resource for modules, scripts, and web links for developers of Python-based software for life science
research
Basically, we just like to program in py thon and want to make it as easy as possible to use python for
bioinformatics by creating high-quality, reusable modules and scripts
1.1.1 What can I find in the Biopython package
The main Biopython releases have lots of functionality including
The ability to parse bioinformatics files into python utilizable data structurcs, including support for
the following formats
Blast output- both from standalone and www blast
Clustalw
FASTA
GenBank
Pubmed and medline
ExPASy files, like Enzyme, Prodoc and Prosite
SCOP, including dom' and lin?files
niCene
Swiss Prot
Files in the supported formats can be iterated over record by record or indexed and accessed vi
Dictionary interface
Code to deal with popular on-line bioinformatics destinations such as
ncbi- Blast. Entrez and pubmed services
EXPASy-Prodoc and prosite entries
e Interfaces to common bioinformatics programs such as
Standalone blast from ncbl
Clustalw alignment program
a standard sequence class that deals with sequences, ids on sequences, and sequence features
Tools for performing common operations on sequences, such as translation, transcription and weight
calculations
Code to perform classification of data using k Nearest Neighbors, Naive Bayes or Support Vector
achines
Codc for dcaling with alignments, including a standard way to crcatc and dcal with substitution
matrices
Codc making it casy to split up parallelizable tasks into scparatc processcs
GUI-based programs to do basic sequence manipulations, translations, BLASTing. etc
Extensive documentation alld help with using the nodules, including this lile, OIl-line wiki documen
tation, the web site and the mailing list
Integration with BiosQL, a sequence database schema also supported by the BioPerl and bioJava
projects
We hope t, his gives you plenty of reasons to download and start using Biopython
1.2 Installing Biopython
All of the installation information for Biopython was separated from this document to make it easier to keep
updated
Theshortversionisgotoourdownloadspage(http://biopython.org/wiki/download),downloadand
nstall the listed dependencies, then download and install biopython. For Windows we provide prc-compilcd
click-and-run installers, while for Unix and other operating systems you must install from source as described
in the included ReadME file. This is usua. ly as simple as the st andard command
sudo python setup. py install
&. The longer version of our installation instructions covers installation of python, Biopython dependencies
and Biopython itself IT is available in Pdf(Http: //biopython. org/dist/docs/instAll/insTallation
df)andHtmlformats(http://biopython.org/dist/docs/insTalL/inStallation.html)
1.3 FAQ
1. Which“ Numerical python” do i need?
For Biopython 1. 48 or earlier, you need the old Numeric module. For Biopython 1.49 onwards, you
need the newer NumPy instead. Both Numeric and NumPy can be installed on the same machine fine
Seealsohttp://numpy.scipy.org/
2. Why is the Seq object missing the(back)transcription translation methods described in this Tutorial?
You leed Biopython 1.49 or later. Alternatively, use the Bio Seq Nodule functions described ill
Section 3.11
3. Why doesni'l Bio SeqIO work: IL irnyor'ts fine but cher e is neo Purse function elc
You need Biopython 1.43 or later. Older versions did contain some related code under the Bio. SeqIO
name which has since been deprecated- and this is why the import " works
4. Why doesn,'t Bio. SeqIO read()work? The module imports fine but there is no read function!
You need biopython 1.45 or later
5. Why isn't Bio AlignIO present? The module import fails!
You need Biopython 1.46 or later
6. What file formats do Bio SeqID and Bio. AlignIO read and write?
Seehttp://biopython.org/wiki/seqioandhttp://biopython.org/wiki/alignioonthewikifor
the latest listing
7. Why don't the Bio. SeqIO and Bio. AlignIO input functions let me provide a sequence alphabet?
You need Biopython 1.49 or later
8. Why doesnt str(...)gi
aue me
the fall sequence of a Seq object?
You leed Biopython 1.45 or later. Alternatively, rather than str(my_seq), use my_seq. tostring)
( which will also work on recent, versions of Biopython)
9. Why doesn't Bio Blast work with the latest plain text NCBI blast output?
The ncbi keep tweaking the plain text output from the blast tools, and keeping our parser up to
datc is an ongoing struggle. Wc recommend you usc the XML output instcad, which is designed to bc
read by a Collputer progralr
10. Why doesn't Bio Entrez read( work The module imports fine but there is no read function
You need Biopython 1.46 or later
11. Why doesn't Bio PDB. MMCIFParser work I see an import error about MMCIFlex
Since Biopython 1.42, the underlying Bio PDB mmCIF. MMCIFlex module has not been installed by
default. It requires a third party tool called Hex(fast lexical analyzer generator). At the time of
writing, you'll have install flex, then tweak your Biopython setup. py file and reinstall from source
12. I looked in a directory for code, but I couldn't seem to find the code that docs something. Where's it
hidden?
One thing to know is that we put code in__init__. py files. If you are not used to looking for code
in this file this can be confusing The reason we do this is to make the imports easier for users. For
instance, instead of having to do arepetitive import like from Bio GenBank import GenBank, you
can just use from Bio import GenBank
Chapter 2
Quick Start- What can you do with
Biopython?
This section is designed to get you started quickly with Biopython, and to give a general overview of what is
available and how to use it. All of the examples in this section assume that you have some general working
knowledge of python, and that you have successfully installed Biopython on your system. If you think you
need to brush up on your python, the main python web site provides quite a bit of free documentation to
gotstartcdwith(http://www.pythonorg/doc/)
Since much biological work on the computer involves connecting with databases on the internet, some of
the examples will also require a working internet connection in order to run
Now that that is all out of the way, let's get into what we can do with Biopython
2.1 General overview of what Biopython provides
As mentioned in the introduction, Biopython is a set of libraries to provide the ability to deal with " things
f interest to biologists working on the computer. In general this means that you will need to have at
least somc programming cxpcricncc (in python, of coursc! )or at Icast an intcrcst in learning to program
Biopython's job is to make your job easier as a programmer by supplying reusable libraries so that you
can focus on answering your specific question of interest, instead of focusing on the internals of parsing a
particular file format(of course, if you want to help by writing a parser that doesn't exist and contributing
it to Biopython, please go ahead! ) So Biopython's job is to make you happy!
One thing to note about Biopython is that it often provides multiple ways of " doing the same thing
Things have improved in recent releases, but this can still be frustrating as in Python there should ideally
bc onc right way to do something. However, this can also bc a rcal boncfit bccausc it givcs you lots of
flexibility and control over the libraries. The tutorial helps to show you the CommoN or easy ways to do
things so that you can just make things work. To learn more about the alternative possibilities, look in the
Cookbook(Chapter 12, this has some cools tricks and tips), the Advanced section(Chapter 13), the built
in"docstrings"(via the python help command, or the APi documentation or ultimately the code itself
2.2 Working with sequences
Disputably(of course!), the central object in bioinformatics is the sequence. Thus, we'll start with a quick
introduction to the Biopython mechanisms for dealing with sequences, the Seq object, which we'll discuss in
ore detail in Chapter 3
Most of the time when we thin k about sequences we have in my mind a string of letters like'AGTACACTGGT
You can create such Seq object with this sequence as follows-the >> represents the python prompt
followed by what you would type in
>> from Bio Seq import Seg
>> my_seq Seq ("AGTACACTGGT")
>> myseq
Seq('AGTACACtGGT', Alphabet))
>> print my-seq
AGTACACTGGT
>>> my_seq. alphabet
Alphabet o
What we have here is a sequence object with a generic alphabet. -reflecting t,he fact. we have not spec
ified if this is a DNA or protein sequence(okay, a protein with a lot of Alanines, Glycines, Cysteines and
Threonines! ) We'll talk more about alphabets in Chapter 3
In addition to having an alphabet, the Seq object differs from the python string in the methods it
supports. You can't do this with a plain string
>>> my_seq
Seq('AGTACActGGT,, Alphabet O)
>>>my_seq. complemento)
Seq( TCATGTGACCA', Alphabet))
>>>my_seq. reverse_complementO)
Seq('ACCAGTGTACT,, Alphabet O)
The next most important class is the SeqRecord or Sequence Record. This holds a sequence(as a Seq
object) with additional annotation including an identifier, name and description. The Bio. SeqIO module
or readin
ng and writing sequence file formats works with SeqRecord objects, which will be introduced below
and covered in Imore detail by Chapter 4
This covers the basic features and uses of the Biopython sequence class. Now that you've got some idea
of what it is like to interact with thc Biopython libraries, it's timc to delve into thc fun, fun world of dcaling
with biological file formats!
2.3 A usage example
Before we jump right into parsers and every thing else to do with Biopython, let's set up an exampl
motivate everything we do and make life more interesting. After all, if there wasn't any biology in this
tutorial, why would you want you read it
Qn Since I love plants, I think we're just going to have to have a plant based example (sorry to all the fans
of other organisms out there! ) Having just completed a recent trip to our local greenhouse, we've suddenly
developed an incredible obsession with Lady Slipper Orchids(if you wonder why, have a look at some Lady
Slipper Orchids photos on Flickr, or try a google Image Search)
Of course, orchids are not only beautiful to look at, they are also extremely interesting for people studying
evolution and systematics. So Ict's supposc worc thinking about writing a funding proposal to do a molccular
study of Lady Slipper evolutiOn, and would like to see what kind of research has already been done alld how
we can add to that
After a little bit of reading up we discover that the Lady Slipper Orchids are in the Orchidaceae family and
the Cypripedioideae sub-family and are made up of 5 genera: Cypripedium, Paphiopedilum, Phragmipedium
Selenipedium and mexipedium
That gives us enough to get started delving for more information. So, let's look at how the Biopython
tools can help us. We'll start with sequence parsing in Section 2.4, but the orchids will be back later on as
well-for example welI search PubMed for papers about orchids and extract sequence data frOln GenBank il
Chapter 7, extract data from Swiss-Prot from certain orchid proteins in Chapter 8, and work with Clustalw
multiple sequence alignments of orchid proteins in Section 12.2.1
(系统自动生成,下载前可以参看下载内容)
下载文件列表
相关说明
- 本站资源为会员上传分享交流与学习,如有侵犯您的权益,请联系我们删除.
- 本站是交换下载平台,提供交流渠道,下载内容来自于网络,除下载问题外,其它问题请自行百度。
- 本站已设置防盗链,请勿用迅雷、QQ旋风等多线程下载软件下载资源,下载后用WinRAR最新版进行解压.
- 如果您发现内容无法下载,请稍后再次尝试;或者到消费记录里找到下载记录反馈给我们.
- 下载后发现下载的内容跟说明不相乎,请到消费记录里找到下载记录反馈给我们,经确认后退回积分.
- 如下载前有疑问,可以通过点击"提供者"的名字,查看对方的联系方式,联系对方咨询.