Hands-on genome annotation (Chinese version)

translated by Dou Kai with additions by Chen PeiJie

基因组注释，基因结构及功能

在本次练习中，你将了解基因组注释工具，在实践中验证你对基因结构、功能及其在生物中心法则中的作用的理解是否正确。本次练习使用的材料为大肠杆菌Escherichia coli K-12的标准基因组。

本次练习将得到对大肠杆菌3个基因（乳糖操纵子）的鉴定和功能预测结果。在练习开始之前，完成以下学习内容将会对联系有所帮助：

更新你对基因基本结构的认知
将中心法则的基本原理应用于一条基因序列
比较并对比真核生物和原核生物基因的相同点及不同点

下表列出的细菌基因中的关键组件信息将对你有所帮助：

组件条目	定义
启动子	基因中两段保守的序列，分别位于转录起始位点上游-10和-35处，是RNA聚合酶及sigma因子的识别及绑定位点。
转录起始位点	RNA聚合酶开始转录的核苷酸位点（+1）。
转录终止位点	转录结束的核苷酸位点。RNA聚合酶在DNA上脱离。
Shine-Dalgarno序列	核糖体小亚基绑定位点。
翻译起始位点	第一个携带氨基酸的转运RNA对应的核苷酸位点，由此开始肽链的延伸。
翻译终止位点	标志翻译结束的终止密码子。

The English version can be seen here

第一部分：基因功能

什么是基因？基因是如何被鉴定的？

通常，识别一个基因组中所有核苷酸序列的第一步是鉴定该基因组中潜在的所有基因，而这一步常常由执行一系列数学算法的计算机程序来完成。为了更好的理解这一过程，具备识别基因关键组件的能力很重要。因此，在本次练习中，我们将利用已测序基因组中的一小部分序列进行基因鉴定。

我们的研究对象是一条细菌染色体。细菌染色体通常为单一的环状DNA双螺旋结构。其大小可从不到1百万个碱基对至6百万个碱基对。

在开始本次练习前，请先更新你对基因定义的认知（生物信息学中的基因结构及功能），并对以下一个既高级又简单的问题做出回答：基因在中心法则（基因转录及翻译的过程）中的作用是什么？

假定现在有一个新鉴定的细菌完成了基因组测序，你会通过什么方式在这个基因组中鉴定出一个基因（即你将在序列中寻找什么特征来确定基因的存在）？

在所有基因中都保守的组件是基因翻译过程中的起始及终止密码子。因此，预测蛋白编码基因所编码蛋白序列边界的第一步是鉴定出起始和终止密码子。随后，鉴定出启动子区及 Shine-Dalgarno序列对确认被预测基因的存在性有重要作用。

请注意：虽然启动子区和 Shine-Dalgarno序列在一株细菌中相当保守，但它们在不同的生物体间具有差异。此外，在操纵子中，并不是所有的基因前端都有启动子区。因此，尽管这些序列对基因有重要意义，它们通常不在首要步骤中被用来鉴定基因。

在细菌及真核生物中搜索基因时，基因的结构会有怎样的相似性及差异性

差异性：细菌与真核生物的启动子区不同。真核生物具有单区段的启动子区（-35）。这是由于细菌和真核生物的RNA聚合酶以及与启动子区结合的转录因子具有不同的结构。真核生物没有Shine-Dalgarno序列，而是通过在转录后RNA的5‘端添加帽子修饰结构进而与核糖体结合。

相似性：两者翻译的起始与结束位点相似。基因组件的结构位置（即启动子区，转录和翻译的起始及终止区等）相似。

下面是一段来源于细菌染色体上的单链DNA序列。请按照列出的条目找出其在序列中的位置并回答后续的问题，这将有助于你鉴定出该序列中包含的基因。给出的序列代表编码连，即该链中的信息将被读取并用于转录过程。

ctcattaggcaccccaggctttacactttatgcttccggctcgtatgttgtgtggaattgtgagcggataacaatttcaca

cacaaggaaacagctatgaccatcattacggattcactggccgtcgacggcaggccacgttcggcaatttaacg

agcgttattgaaataggcgggggcacgccccctctagtactcataaaaaaagtgatcat

下面的条目都为基因的组件。请对它们进行阐述。例如，当阐述转录起始位点时，请说明在该位点会发生什么事件。此外，请鉴定并在上面的序列中标记出这些组件的位置。

启动子区（该区域富含T及A碱基）
· Pribnow框或-10区：序列特征为tatgtt or（tataat）
-35区：序列特征为tttaca or（ttgaca）
转录起始位点（+1；通过启动子区的位置识别）
翻译起始位点（atg）
翻译终止位点（taa，tag或tga）
Shine-Dalgarno序列（agga）
转录终止位点（该处为一个近似的区域，不必准确判断）

请按照自己的方式尽可能详尽地展示出自己的结果。我的操作方式为，首先将序列输入至GeneDoc，调整字符大小（25-30）并删除一致行（使用Ctrl+G的设置项）。然后我将使用多个页面工具（或者MEGA）将该核苷酸序列翻译为氨基酸序列（例如使用非常方便的ExPASy）。至此，你将马上明了在这个序列中是否存在一个有意义的阅读框。接着，我将使用GeneDoc中的查找工具搜索序列中一些重要的基序，并使用截图工具在GeneDoc中截取视图然后黏贴到PowerPoint文档中用以注释。当然，你完全可以按照自己熟悉的方式来展示这个序列。

该基因编码的蛋白的初级结构是什么？请记住，该序列与转录过程中产生的RNA具有相同的序列信息。理想的状况下，我更希望你按照遗传密码子手动将该序列翻译为氨基酸序列。至少应该手动练习一次。在使用三联体密码子时，请将T替换为U。当然，你也可以使用一个基于网页的翻译工具完成此项工作。或者你同时使用工具及手动翻译，并将两种方式得到的结果进行比较。

你需要把所有的组件条目都映射到序列上。这些组件的顺序对基因成功的完成转录和翻译过程具有重要意义。为了加深对其重要性的理解，请回答如下与基因结构相关的问题：

相对于转录起始和终止位点，翻译起始和终止位点在什么位置？这种位置关系的重要性是什么？
相对于转录起始位点，Shine-Dalgarno序列在什么位置？这种位置关系的重要性是什么？
相对于转录和翻译起始位点，启动子区在什么位置？这种位置关系的重要性是什么？

请注意，上面的序列是为了练习而编制过的（多肽链的长度被截短）。如果你好奇该序列的原貌，你可以使用序列相似性搜索工具查找出完整序列的接受号及对应的蛋白序列。

(可选操作）该序列编码什么蛋白产物？该蛋白是否可以被分泌？

第二部分：未知序列的注释

用于该练习的序列来自于NCBI数据库，其信息为：Escherichia coli K12 subst. W3110 (ref: EF136884.1). 当使用各种注释工具操作该序列时，请保证序列中没有空格存在。将该序列黏贴至Notebook或Word文档中，并保存为.txt文件以便操作。

学习目标：

通过鉴定基因的关键组件来确认基因的存在。
鉴定操纵子的关键组件。
基于核苷酸序列，预测其对应蛋白的功能。
描述基因组注释的基本步骤。

本次练习将指导你完成手工基因组注释的前期步骤。给定序列后（部分基因组），你将破译其中潜在的基因数量和其假定的功能。请按照如下给定的步骤来完成这一练习。

获取练习用的核苷酸序列。该序列长度大约为5500个和核苷酸。该序列代表了双链DNA染色体中的一条链。确定你已将该序列保存至.txt文件中并且序列中不存在空格，否则这将造成基因查找工具产生错误的分析结果。

接下来你将进行一个开放式的评估过程！

一种操作方式是通过序列相似性进行分析。在你的研究对象为著名模式生物并且该生物有多个基因组完成测序并储存于公共数据库时，你才有可能按照这种方式进行分析。接下来的分析中，利用blast算法进行搜索会是一个有用的操作步骤。尽管有工具的辅助，你还需花费时间将5000个核苷酸的序列手工定位于你通过blast获取的参考序列上，这将是一项乏味的工作。如果你按照这种方式操作，请检索出对应于蛋白序列的所有核苷酸序列，并与对应的氨基酸对齐（如果你想使用氨基酸序列进行对齐操作，则需要通过6框读码方式对你的核苷酸序列进行翻译）。一定记录下检索得到的序列接受号。

另一种操作方式是使用具有注释功能的计算机程序，例如Artemis: Genome Browser and Annotation Tool (Rutherford et al., 2000 Sanger Institute)。Artemis是一种自由开放的基因组浏览与注释工具软件。通过该工具可以对序列特征，二代测序序列以及基于序列研究背景的分析结果进行可视化操作，同时也支持序列的六框翻译分析。

Artemis使用Java语言编写，具有UNIX, Macintosh 及 Windows 操作系统的不同安装版本。该软件可以读取EMBL和GENBANK数据库条目及FASTA格式的序列（具有索引的FASTA或原始格式），并可以读取以EMBL, GENBANK或GFF保存的其它序列特征信息。该自由开放软件使用数学算法鉴定基因组中潜在的基因，帮助你识别基因的翻译起始及终止位点。

通过检索 Artemis：Genome Browser and Annotation Tool信息获取下载站点，点击“Download“选项卡，你将被引导至该软件的下载页面并在你的计算机中下载安装包或直接安装。如果不想下载软件安装包，请点击”Launch Artemis“按键。

一个带有程序标题（Artemis）的小窗口将被打开。依次点击File菜单及其中的Open选项。找到并打开你的待分析序列文件（程序将自动搜索本地序列文件，但你需要将查找对话框中的“sequence files“改为”all files“）。如果操作成功，你将在程序中看到你的序列。

简要介绍以下 Artemis的窗口界面。在窗口展示信息的最顶端是包含黑色竖线的三行，然后是具有灰色条的两行，接下来是包含黑色竖线的另外三行。每一行代表了该序列六框翻译中的一种读码框。顶部三行分别代表了当前链的三种读码框，底部三行代表了互补链的三种读码框。黑色竖线代表了读码框中的终止密码子。使用右侧的滚动条对图像进行聚焦或缩小。在程序的底部展示了对应的详细信息。包括序列本身在内，六框读码及其对应的氨基酸信息都将被展示出来。底部的滚动条用来沿着序列进行滑动展示，右侧的滚动条用于对局部区域进行放大或缩小。

在未知序列中推测基因：

下面我们将使用Artemis程序在你的序列中推测基因。

启用推测基因的算法需要先找到“create“标签。向下滑动并选择”mark open reading frames…“（an open reading frame 或 ORF是推测基因的术语）。

Artemis程序将会询问查找开发阅读框的最小长度（氨基酸的长度）。为了避免推测出过短的基因序列（通常是由于距离较近的起始和终止密码子造成的非基因序列），我们将搜寻开放阅读框大于200个氨基酸的序列，因此请输入200。

所有推测出的ORFs或基因都将以蓝色高亮显示。你观察到了几个推测基因？

让我们先确认起始密码子。如黑色线条所示，该程序出色的鉴定出了终止密码子。同时，查看一下哪些序列被用作起始密码子也很重要。某些细菌对特定的终止密码子有偏好性。在当前序列中，我们查找传统的起始密码子ATG。

在放大的底部窗口中，使用底部滚动条滑动至第一条序列的前端，或双击顶部窗口的第一条序列。底部窗口的基因序列将被高亮显示。该推测的基因序列是否起始于ATG起始密码子？如果不是，你可以将推测的基因序列修剪到你期望的起始密码子处。

如果推测的基因序列并不是起始于ATG，点击该序列将其高亮显示。找到程序上方的“edit“选项，选择其中的”trim selected feature“，然后选择”trim to met“。

对其它的基因也完成这一操作过程。

怎样确认推测基因的正确性？

Artemis程序已经查找到了推测基因翻译的起始和终止位点。还可以通过查找其它的什么结构/序列用于辅助确定基因推测的正确性？在下面列出你想搜寻的结构：

如果你决定去搜寻Shine-Dalgarno和启动子区，下面是查找这些重要基因特征的指导建议。

查找Shine-Dalgarno区：

Shine-Dalgarno会出现在基因的什么部位？在程序的底部窗口中移动到大体位置并查找保守的Shine-Dalgarno序列：AGGA或AGCA。

每个基因都会有Shine-Dalgarno序列区么？有或者没有的原因分别是什么？
你在每个推测基因中都找到Shine-Dalgarno序列区了么？这说明了什么？

查找启动子区：

启动子区会出现在基因的什么部位？在程序的底部窗口中移动到推测基因的大体位置。

虽然该基因组可能有不同的启动子区一致序列，我们将使用Escherichia coli的一致序列进行查找。

（-35：TACACT; -10: TATGTT）。请注意，该区域均为T和A的富集区。

b. 你一定会在每个推测基因中都找到启动子区么？是或者不是的原因分别是什么。

c. 你在每个推测基因中都找到启动子序列区了么？如果不是，这表明了什么？

这些是真实的基因么？

基于你当前所得到的信息，请给出你对推测基因真实性的判断，并用你获得的信息来支持你的观点。

这些基因的潜在功能是什么？

现在，你已经完成了对基因序列的基本分析。接下来可以对基因编码蛋白的功能进行分析了。这些蛋白在细胞中行使什么功能？你确信你的预测结果么？是否需要通过实验来验证这一预测结果？

收集推测基因的蛋白产物并进行比较分析。

既然你已经得到了推测的基因序列，接下来你可以使用密码子表来预测该基因编码的蛋白序列。这一工作可以在Artemis程序中完成。点击第一条基因（其会被高亮显示），在菜单栏选择“View”菜单选项并点击“amino acids of selection”。程序会打开一个独立的页面并在其中显示选中基因的初级蛋白序列。高亮显示并拷贝这些核苷酸序列。

在功能验证过的蛋白质数据库中比较你的蛋白序列。

与你的序列相似的蛋白序列在数据库中的名称是什么？

该数据库中的相似蛋白来自哪种生物体（这一信息可以在比对结果的顶部获取）？

思考一下这些比对信息告诉了你什么？是否你的蛋白与数据库中的相似蛋白也具有相似功能？请提供证据以支持你的观点。

功能信息：

使用比对到的蛋白信息进行检索。你得到的蛋白的功能是什么？EcoCyc: http://ecocyc.org/.是一个出色的用于E. coli蛋白功能预测的数据库。该数据库的管理员依据基于实验数据的科学文献对 E. coli的基因组进行代谢和功能注释。这些实验数据包括了E. coli的蛋白结构，酶功能，基因产物的调控以及代谢通路的构建信息。如果你预测的蛋白质匹配到了这个数据库中的信息，你会对你的蛋白质有进一步的了解并预测更多的功能。

进入到EcoCyc的页面：在右侧输入框中输入你推测到的蛋白的名称（或4个字母表示的基因名称）。点击快速搜索按键。如果在E. coli K-12的基因组中有相似的蛋白，这些信息将被展示在“protein”下面。点击你的蛋白的名称。

这些结果告诉了你什么信息？你应该会看到该基因的调控信息以及相应蛋白的定位和功能信息。尽可能多的对你推定出的蛋白进行研究分析，并基于你在EcoCyc中获得的信息描述该蛋白的功能。

对所有你所推测出的蛋白完成功能分析（基于BLAST和EcoCyc）。请说明：这些蛋白在其功能上有什么共同点么？它们的功能是否相关？

讨论这些蛋白可能的调控机制。

软件安装：

Artemis是一个免费的基因组浏览器和注释工具，它允许在序列的上下文中，以及六框阅读翻译过程中对序列特征、下一代数据分析结果进行可视化。它由英国的Sanger研究所（the Sanger Institute）开发维护。Artemis 是用 Java 编写的，可用于 UNIX、Macintosh 和 Windows 系统。它可以以 FASTA、索引式 FASTA 或原始格式读取 EMBL 和 GENBANK 数据库条目或序列。其他序列特征可以是 EMBL、GENBANK 或 GFF 格式。你可以在它的下载主页http://sanger-pathogens.github.io/Artemis/Artemis/ 查看并下载适合你的计算机操作系统的版本。

由于Artemis由 Java 编写，它的运行也需要 Java 环境，因此你需要在你的计算机上下载并安装Java 等相关软件。下载并安装适合你的计算机操作系统的 Java 软件后，可以获得 Java 运行时环境 (JRE)，相关下载请参考 Java 主页https://www.java.com/zh-CN/download/，同时，你还需要在 Java development kit （JDK）主页https://www.oracle.com/java/technologies/downloads/#jdk18-windows，下载并安装与之前下载的 JRE 相同版本的JDK以确保正常使用。

当你将JRE和JDK都安装好后，将Artemis压缩包解压缩到你想要的文件夹，双击artemis.jar图标，便可进入软件操作界面。

The English version can be seen here

April 25, 2022April 25, 2022

Hands-on genome annotation

Part 1: Gene function

How are genes identified? What is a gene?

Usually, one of the first steps upon identifying all of the nucleotide sequences in a genome is to identify all potential genes that may exist in the genome. Typically, this is completed using a computer program that runs a variety of mathematical algorithms to identify potential genes. However, to better understand this process, it is important to identify the critical components of a gene. Therefore, in this exercise, we will use a small portion of a sequenced genome to identify a gene.

We will be working with a bacterial chromosome. Bacterial chromosomes are commonly constructed as a single circular double-stranded piece of DNA. Their size can range from just under one to 6 million nucleotides.

Before starting this exercise, update yourself on gene definition (in bioinformatics context, gene structure, and gene function) and formulate an answer to a superior easy question: what role does gene play in the central dogma (the process of transcription and translation)?

Imagine that a newly identified bacterium has just had its genome sequenced. Propose how you might identify a gene in its genome (i.e., what will you look for in the sequence to characterize a gene?).

The conserved components of all genes are the start and stop codons for translation. Therefore, the first step should be identifying the start and stop codons to predict where protein borders are. Then, it would be essential to identify promoters and Shine-Dalgarno sites to verify the existence of the predicted gene.

Please note that promoters and Shine-Dalgarno sequences, although relatively conserved within an individual bacterium, vary from organism to organism. In addition, not all genes will have a promoter directly in front of the gene in an operon. Therefore, although these sequences are necessary, they are not commonly used for the initial identification of genes.

How might gene structure be similar or different when looking for genes in bacteria and eukaryotes?

Differences: Promoters are different in bacteria and eukaryotes. Eukaryotes contain a single sequence promoter (-35). This is because they use structurally different RNA polymerases and transcription factors responsible for binding to the promoter. Eukaryotes do not have a Shine-Dalgarno for ribosome binding. Instead, eukaryotes use the 5’ cap for ribosome binding.

Similarities: Start and stop sites for translation are similar. The structural location of gene components (i.e., the promoter, site for transcription, and translation starts and stops) is similar.

Below is a single strand of DNA taken from a bacterial chromosome. Identify the pieces listed below and answer the following questions to help you find the gene in this sequence. This strand represents the coding strand, meaning this is the strand that would be read during translation.

ctcattaggcaccccaggctttacactttatgcttccggctcgtatgttgtgtggaattgtgagcggataacaatttcaca

cacaaggaaacagctatgaccatcattacggattcactggccgtcgacggcaggccacgttcggcaatttaacg

agcgttattgaaataggcgggggcacgccccctctagtactcataaaaaaagtgatcat

The items below are all components of a gene. Please define each of these items. For example, when asked to define the start site for transcription, please identify what occurs at this location. In addition, please identify and label each region on the sequence above:

Promoter (promoter regions are rich in Ts and As)

Pribnow or -10 region: tatgtt

-35 region: tttaca

start site for transcription (+1; figure out from promoter)

start site for translation (atg)

stop site for translation taa, tag or tga

Shine-Dalgarno site (agga)

Stop site for transcription termination (this will be an approximation)

Please feel free to find your way to present the results in the most informative way. First, I would import the sequence in GeneDoc, make the font large (25-30), and remove the consensus line (use Ctrl+G for the setup options). Then I would use one of the multiple web tools (or MEGA) to translate the nt sequence to aa (for example, ExPASy is very handy). So, you may immediately get an idea of whether there is a meaningful reading frame in this sequence. Then I would use the search tool in GeneDoc and search for essential motives. Finally, I would also use a snipping tool to copy the view from GeneDoc and paste it in, for example, PowerPoint for annotation. However, once again, you are skillful in finding how to present these relatively simple data.

What is the primary protein structure encoded by this gene? Remember, this strand represents the same sequence as the RNA copied during transcription. Although ideally, I would expect you to use the genetic code and do the manual translation of codons to amino acids, it should be practiced manually at least once. Replace the T’s with U’s to use the triple codon table. Of course, you may do the translation with one of the numerous web-based translating tools. Maybe you would wish to do both and compare the outcome.

You should have all of the terms mapped on the sequence. The order of these terms is essential for the successful transcription and translation of a gene. To address its importance, please answer the following questions regarding the gene’s structure:

Where are the start and stop sites for translation concerning the start and stop site for transcription? Why is this important?

Where is the Shine-Dalgarno site related to starting sites for transcription and translation? Why is this important?

Where is the promoter related to the start sites for transcription and translation? Why is this important?

Please note that this sequence has been composed for the exercise (the polypeptide length has been truncated). However, if you are curious enough, you may use the sequence similarity search and identify the accession number of the entire gene and the protein used for this exercise.

(optional) What does it encode? Is the protein secreted?

Part 2: Annotation of an unknown sequence

This sequence is obtained from the NCBI database: Escherichia coli K12 subst. W3110 (ref: EF136884.1). No gaps or spaces in the sequence can exist when working with this sequence in one or another annotation tool. Paste this sequence into a Notebook or Word file and save it as a .txt file for use.

Learning goals:

Verify the presence of a gene by identifying its key components.

Identify critical components of an operon.

Predict the function of a protein-based on its gene sequence.

Describe the basic steps for annotating a genome.

This exercise will guide you through these first steps of manual genome annotation. You have been given an unknown sequence (part of a genome) to decipher the number and putative function of any potential genes in this sequence. Follow the steps below to complete the activity.

Obtain your nucleotide sequence. It is approximately 5500 nucleotides long. This is a single-stranded representative of a double-stranded DNA chromosome. Make sure you have the sequence in a .txt file. In addition, make sure there are no spaces or gaps between the nucleotides. This can result in inaccurate analysis by the gene-finding program we will use.

Here you have an open-end evaluation process!

One option may be based on the analysis of sequence similarity. It may only be possible because we work with the famous model organism with multiple genomes sequenced and deposited. Consequently, the search using the blast algorithm may be helpful. However, be prepared to spend some time locating your tiny 5500 nt-long fragments in the reference sequence you get by the blast. It would be tedious manual work. If you go this way, please retrieve all nt sequences for your detected proteins and align them to the target sequence (you may also wish to work on aa level, then do translation and consider all six reading frames). Always record accession numbers!

Another option is to use a computer program developed for this purpose. For example, you may use the program Artemis: Genome Browser and Annotation Tool (Rutherford et al., 2000 Sanger Institute). Artemis is a free genome browser and annotation tool that allows the visualization of sequence features, next-generation data, and the results of analyses within the sequence context and its six-frame translation.

Artemis is written in Java and is available for UNIX, Macintosh, and Windows systems. It can read EMBL and GENBANK database entries or sequences in FASTA, indexed FASTA, or raw format. Other sequence features can be in EMBL, GENBANK, or GFF format. This free program uses a mathematical algorithm to identify potential genes in the genome sequence. In addition, it will identify the translation start and stop sites for you.

Find Artemis: Genome Browser and Annotation Tool (locate yourself), click on the “Download” tab. You will be directed to another page where you can download the software onto your computer or launch it directly. To avoid downloading the program, click on the button that says “Launch Artemis.”

A small screen with the program title (Artemis) will open. Click on File and then on Open. Find and open your unknown sequence (The program automatically looks for sequence files. You will have to change the files it looks for by changing the search to “all files” instead of “sequence files.”) If successful, you should see your sequence in the program.

To briefly describe what is seen in Artemis, there are three rows containing black lines at the top of the screen, then two gray bars, followed by another three sets of rows containing the black lines. Each row represents one of the six frames for translation on the sequence. The top three represent the frames for reading the codons on the top strand of DNA, while the bottom three represent the frames for reading the codons on the bottom strand of DNA. The black lines represent stop codons within that reading frame. Use the right scroll bar to focus on the image and focus back out. At the bottom of the program is the information in more detail. The six reading frames are represented, and the amino acid code for all six translational frames is included. In addition, the sequence is located here as well. You can use the bottom scroll bar to move along the entire sequence. You can use the right scrollbar to focus on an area of focus on an area.

Identify putative genes in your unknown sequence:

Let us use the program to find putative genes in your sequence.

To activate the algorithm to identify putative genes, find the “create” tab. Then, scroll down and select “mark open reading frames…” (an open reading frame or ORF is another term for a putative gene).

Artemis will ask the minimum size for the open reading (how many amino acids long?). In order to prevent the identification of very small genes (that are likely not genes but are merely a start and stop codon in close proximity), we are going to search for open reading frames over 200 amino acids long. Type in 200.

Any identified ORFs or putative genes should be highlighted in blue. How many putative genes do you see?

Let us verify the start codon. As shown by the many black lines, the program does an excellent job identifying stop codons. First, however, it is essential to look at what start codon is being used. For example, some bacteria have a preference for certain stop codons. In this scenario, we will look for the traditional ATG start codon.

In the bottom (more magnified) window, scroll (using the bottom scroll bar) to the front of the first gene or double click on the first gene in the top window. This will highlight the gene sequence in the bottom window. Does it start with an ATG start codon? If not, you can “trim” the gene to our desired start codon.

If the gene does not start with ATG, highlight the gene (by clicking on it). Now locate the “edit” tab at the top of the program. Next, select the “trim selected feature.” Further, select “trim to met.”

Do this for your other genes.

How can we verify that these might be genes?

The Artemis program has found the start and stop sites for the translation of each gene. What other structures/sequences might you look for to help verify that these are genes? List any structures you might look for here:

You might have decided to look for the Shine Dalgarno and promoter regions. Here is a guide to examining the genes for these important gene features.

Find the Shine-Dalgarno region:

Where would the Shine Dalgarno be present on a gene? Go to this area of the gene in the bottom window. For example, look for the following Shine Dalgarno sequences: AGGA or AGCA.

Should a Shine Dalgarno be present for each gene? Why or why not?

Did you find one for each gene? What does this information provide?

Find the promoter region:

Where would the promoter be present on a gene? Go to this area of the gene in the bottom window.

Although this genome may have different promoter consensus sequences, we will use a set of Escherichia coli consensus sequences.

(-35 sequence: TACACT; -10 sequence: TATGTT). Note that they are rich in T’s and A’s.

Must you find a promoter for each gene? Why or why not?

Did you find a promoter for each gene? If not, what does this suggest?

Are these genes?

Based on the evidence you have collected so far, reflect on whether you believe the identified genes are indeed genes. Then, provide evidence to support your claim.

What is the putative function of these genes?

Now that you have finished a basic analysis of your gene nucleotide sequence, it is time to examine the putative function of the encoded product of these genes. What do these proteins do for the cell? Will you trust your prediction, or will laboratory research still be required to verify the result?

Collect the protein sequence for your genes to conduct your comparison.

Because you have the gene sequence, you could use the codon tables to read the predicted amino acid sequence for this gene. However, in Artemis, the program will do the translation for you. Click on the first gene (so it is highlighted). On the menu, select the “View” tab and click on “amino acids of selection.” This will open a separate page with the primary amino acid sequence for the highlighted gene. Highlight and copy these nucleotides.

Compare your protein sequence with a database of proteins whose functions have been verified.

What is the protein name that it is similar to?

What organism does this protein come from (you can find this information at the top of the alignment)?

Reflect on what this information tells you? is it likely that your protein has a function similar to the aligned protein from the database? Please provide evidence to support your claim.

Function information

Use the protein information that you have received from your alignment to begin your research. What is the function of your protein(s)? One excellent database to help predict an E. coli protein’s function is EcoCyc. Curators of the database use the scientific literature based on experimental data to compile a metabolic and functional description of the Escherichia coli genome. Experimental data include protein structure data, enzymatic function, regulation of these gene products, and construction of metabolic pathways within the organism. Should your putative proteins match this database, much can be learned and predicted regarding its function.

Go to the EcoCyc webpage: Type in the name of each putative protein you have identified (or the four-letter gene name) into the box on the right. Then, click on a quick search. If a similar protein has been identified in the E. coli K-12 genome, these should be listed after the search under “proteins.” Next, click on the name of your protein.

What will the results tell you? First, you should see information regarding the regulation of this gene and information regarding the protein and its location and function. Research as much as you can regarding the putative proteins you have identified. Describe their putative functions here, based on the information you can find on the EcoCyc database.

Complete a functional analysis (BLAST and EcoCyc analysis) for all of your putative proteins. Please address: do these proteins have anything in common regarding their function? Might they be related?

Discuss any mechanisms of regulation that might exist for these proteins.

中文练习说明 Chinese version

April 19, 2022August 16, 2022

Exercise on Genome annotation, gene structure and function

基因组注释，基因结构及功能

In this exercise you will get an introduction to genome annotation tools, i.e. practically
verify your understanding of the structure and function of a gene and its role in the central
dogma of biology. The material for this exercise will be the Escherichia coli K-12 standard genome.

The outcome of the exercise is the identification and prediction of the function of three
genes (the lac operon) in E. coli. Before beginning, it is helpful to address the learning
goals as follows:

本次练习将得到对大肠杆菌3个基因（乳糖操纵子）的鉴定和功能预测结果。在练习开始之前，完成以下学习内容将会对联系有所帮助：

refresh your knowledge on the basic structure of the gene
apply the fundamentals of the central dogma to a gene sequence
compare and contrast the structure of a eukaryotic and prokaryotic gene

更新你对基因基本结构的认知
将中心法则的基本原理应用于一条基因序列
比较并对比真核生物和原核生物基因的相同点及不同点

For your help, the figure shows key components of a bacterial gene
Part 1:

下表列出的细菌基因中的关键组件信息将对你有所帮助：

See the practical part in English
See the practical part in Chinese 中文练习说明