Exercise on Genome annotation, gene structure and function


In this exercise you will get an introduction to genome annotation tools, i.e. practically
verify your understanding of the structure and function of a gene and its role in the central
dogma of biology. The material for this exercise will be the Escherichia coli K-12 standard genome.

在本次练习中,你将了解基因组注释工具,在实践中验证你对基因结构、功能及其在生物中心法则中的作用的理解是否正确。本次练习使用的材料为大肠杆菌Escherichia coli K-12的标准基因组。

The outcome of the exercise is the identification and prediction of the function of three
genes (the lac operon) in E. coli. Before beginning, it is helpful to address the learning
goals as follows:


  • refresh your knowledge on the basic structure of the gene
  • apply the fundamentals of the central dogma to a gene sequence
  • compare and contrast the structure of a eukaryotic and prokaryotic gene
  • 更新你对基因基本结构的认知
  • 将中心法则的基本原理应用于一条基因序列
  • 比较并对比真核生物和原核生物基因的相同点及不同点

For your help, the figure shows key components of a bacterial gene
Part 1:


See the practical part in English
See the practical part in Chinese 中文练习说明

Cai et al. 2021 The pleiotropic functions of intracellular hydrophobins in aerial hyphae and fungal spores

HFB4::mRFP on the surface of Trichoderma guizhouense NJAU 4742

Cai F, Zhao Z, Gao R, Chen P, Ding M, Jiang S, et al. (2021) The pleiotropic functions of intracellular hydrophobins in aerial hyphae and fungal spores. PLoS Genet 17(11): e1009924. https://doi.org/10.1371/journal.pgen.1009924

Animated 3D reconstructions of extracellular HFB-enriched matrices coating sporulating Trichoderma colonies.

Higher fungi can rapidly produce large numbers of spores suitable for aerial dispersal. The efficiency of the dispersal and spore resilience to abiotic stresses correlate with their hydrophobicity provided by the unique amphiphilic and superior surface-active proteins – hydrophobins (HFBs) – that self-assemble at hydrophobic/hydrophilic interfaces and thus modulate surface properties. Using the HFB-enriched mold Trichoderma (Hypocreales, Ascomycota) and the HFB-free yeast Pichia pastoris (Saccharomycetales, Ascomycota), we revealed that the rapid release of HFBs by aerial hyphae shortly prior to conidiation is associated with their intracellular accumulation in vacuoles and/or lipid-enriched organelles. The occasional internalization of the latter organelles in vacuoles can provide the hydrophobic/hydrophilic interface for the assembly of HFB layers and thus result in the formation of HFB-enriched vesicles and vacuolar multicisternal structures (VMSs) putatively lined up by HFBs. These HFB-enriched vesicles and VMSs can become fused in large tonoplast-like organelles or move to the periplasm for secretion. The tonoplast-like structures can contribute to the maintenance of turgor pressure in aerial hyphae supporting the erection of sporogenic structures (e.g., conidiophores) and provide intracellular force to squeeze out HFB-enriched vesicles and VMSs from the periplasm through the cell wall. We also show that the secretion of HFBs occurs prior to the conidiation and reveal that the even spore coating of HFBs deposited in the extracellular matrix requires microscopic water droplets that can be either guttated by the hyphae or obtained from the environment. Furthermore, we demonstrate that at least one HFB, HFB4 in T. guizhouense, is produced and secreted by wetted spores. We show that this protein possibly controls spore dormancy and contributes to the water sensing mechanism required for the detection of germination conditions. Thus, intracellular HFBs have a range of pleiotropic functions in aerial hyphae and spores and are essential for fungal development and fitness.


FungiG Course on Fungal Genes

Winter semester 2021: Trichoderma genes!

Chinese Version: 中文

  • personal tutoring
  • individual schedule
  • real scientific material
  • useful outcome
  • personal tasks
  • up-to-date science
  • no lectures
  • no written reports
  • no shared deadlines
  • individual work
  • online and remote
  • optional group work

What will I learn?

  • The broad scope of fungal functional genetics
  • Fungal genes
  • Genomics
  • Gene nomenclature
  • MycoCosm, KEGG, and other web resources on fungal genetics
  • Recent literature on fungal genetics (incl. fungal diversity and applications)
  • Research community
  • Terminology
  • Evolution, diversity, and speciation

How will it work?

IrinaDruzhinina WeChat
IrinaDruzhinina WeChat
  • Register by sending a personal message to FungiG or contact “IrinaDruzhinina” on WeChat. Below please find the QR code.
  • Get the first task (1/10 of the entire exercise).
  • Submit your results and questions (Email or Wechat)
  • Get feedback and answer to your questions.
  • Repeat until the course is completed.

What is the content of the course?

The WS2021 course will be based on the dynamic list of Trichoderma genes and Trichoderma-associated genes (mainly plant genes studied along with Trichoderma). The next courses may be based on other model fungi.

The tasks for this course are divided into small sets. Each time, a student will get his or her own random set of 2–3 fungal genes. The task will be to search for the genome IDs, protein IDs, function(s), evolutionary history, mutant(s), phenotype(s), GO term(s), KEGG group(s), genomic/chromosome location(s), cluster organization(s), functionally associated gene(s), published gene name(s), host genome(s), orthologue(s), paralogue(s), other homologue(s), patent(s), applied value(s), product(s), and reference(s).

The student is expected to systemically collect associated counts, such as the total number of publications, number of patents, number of Trichoderma spp. studied, etc. Some genes are intensively studied (e.g., cbh1, lae1), and the task will take more effort compared to the others that have only been published once.

What should I know before the course?

  • Basic eukaryotic microbiology and basic mycology
  • Basic biochemistry and cell biology
  • English reading skills
  • Advanced skills in retrieval of scientific literature (FungiG will provide help)

What is the main challenge of the course?

The concept of a fungal gene, gene definition, gene nomenclature in fungi, inconsistency in research approaches, the diversity of genes, the unequal quality of genome annotations, fungal diversity, and taxonomy.

What exactly should I do to complete the course?

A student is expected to deliver a table (as will be specified in the task) describing the functions and properties of several Trichoderma genes or genes associated with Trichoderma research.

The minimum set of genes is 25 (10 sets of 2–3 genes); the upper limit is 300*.

The advanced version of the course includes the joint (online/offline) seminar with students’ presentations and discussions. The aim of the seminar is to appoint the top ten most studied, most useful, and most controversial genes in Trichoderma, respectively.

* The total number of Trichoderma genes is ~12 000, but the number of genes studied for their function(s) is still meager.

Can my tasks be redundant to the tasks of my colleagues?

Sometimes, yes. The majority of the tasks will be unique. However, the genes used for the tasks that failed or were superficially performed remain in the pool of genes for the course.

Can I add genes to the list?

Yes. Students are welcome to do so. These can also be genes from other fungi that are not yet studied in Trichoderma. Please send your proposals to ISD.

What is the course language?


What is the schedule of the course?

The schedule of the course is flexible. The results can be sent at any time. The feedback will be returned within 72 hours, or the exact time will be specified.

How long will it take? How deep can I go?

The course is designed such that an advanced Ph.D. student working on fungal genetics is expected to spend one day per week for 10 weeks or make it in a block (2 weeks, full time). The minimal workload corresponds to 80–90 working hours or 3 ECTS (European Credit Transfer and Accumulation System, Bologna Process).

As you progress, it should become faster. After you get in shape, you may either spend less time per week or learn more genes.

Can I do the entire course remotely?


Who can attend?

The course will present new material to all FungiG members, ranging from master students to Ph.D. candidates, postdocs, and alumni professors working at Nanjing Agricultural University, Shanghai Jiao Tong University, Sun Yat-Sen University, Jiangsu Academy of Agricultural Sciences, and other universities. Students from the TU Wien master program “Biotechnology and Bioanalytics,” are welcome.

Students from the universities or academic institutions that are not listed above, please contact Irina Druzhinina.

Is the course free?

The WS2021 is free of charge but the number of places is limited.

Hands-on genome annotation (Chinese version)

translated by Dou Kai with additions by Chen PeiJie


在本次练习中,你将了解基因组注释工具,在实践中验证你对基因结构、功能及其在生物中心法则中的作用的理解是否正确。本次练习使用的材料为大肠杆菌Escherichia coli K-12的标准基因组。


  • 更新你对基因基本结构的认知
  • 将中心法则的基本原理应用于一条基因序列
  • 比较并对比真核生物和原核生物基因的相同点及不同点



The English version can be seen here

















  • 启动子区(该区域富含T及A碱基)
  • ·         Pribnow框或-10区:序列特征为tatgtt or(tataat)
  • -35区:序列特征为tttaca or(ttgaca)
  • 转录起始位点(+1;通过启动子区的位置识别)
  • 翻译起始位点(atg)
  • 翻译终止位点(taa,tag或tga)
  • Shine-Dalgarno序列(agga)
  • 转录终止位点(该处为一个近似的区域,不必准确判断)




  1. 相对于转录起始和终止位点,翻译起始和终止位点在什么位置?这种位置关系的重要性是什么?
  2. 相对于转录起始位点,Shine-Dalgarno序列在什么位置?这种位置关系的重要性是什么?
  3. 相对于转录和翻译起始位点,启动子区在什么位置?这种位置关系的重要性是什么?




用于该练习的序列来自于NCBI数据库,其信息为:Escherichia coli K12 subst. W3110 (ref: EF136884.1). 当使用各种注释工具操作该序列时,请保证序列中没有空格存在。将该序列黏贴至Notebook或Word文档中,并保存为.txt文件以便操作。


  1. 通过鉴定基因的关键组件来确认基因的存在。
  2. 鉴定操纵子的关键组件。
  3. 基于核苷酸序列,预测其对应蛋白的功能。
  4. 描述基因组注释的基本步骤。





另一种操作方式是使用具有注释功能的计算机程序,例如Artemis: Genome Browser and Annotation Tool (Rutherford et al., 2000 Sanger Institute)。Artemis是一种自由开放的基因组浏览与注释工具软件。通过该工具可以对序列特征,二代测序序列以及基于序列研究背景的分析结果进行可视化操作,同时也支持序列的六框翻译分析。

Artemis使用Java语言编写,具有UNIX, Macintosh 及 Windows 操作系统的不同安装版本。该软件可以读取EMBL和GENBANK数据库条目及FASTA格式的序列(具有索引的FASTA或原始格式),并可以读取以EMBL, GENBANK或GFF保存的其它序列特征信息。该自由开放软件使用数学算法鉴定基因组中潜在的基因,帮助你识别基因的翻译起始及终止位点。

通过检索Artemis:Genome Browser and Annotation Tool信息获取下载站点,点击“Download“选项卡,你将被引导至该软件的下载页面并在你的计算机中下载安装包或直接安装。如果不想下载软件安装包,请点击”Launch Artemis“按键。

一个带有程序标题(Artemis)的小窗口将被打开。依次点击File菜单及其中的Open选项。找到并打开你的待分析序列文件(程序将自动搜索本地序列文件,但你需要将查找对话框中的“sequence files“改为”all files“)。如果操作成功,你将在程序中看到你的序列。




启用推测基因的算法需要先找到“create“标签。向下滑动并选择”mark open reading frames…“(an open reading frame 或 ORF是推测基因的术语)。





如果推测的基因序列并不是起始于ATG,点击该序列将其高亮显示。找到程序上方的“edit“选项,选择其中的”trim selected feature“,然后选择”trim to met“。







  1. 每个基因都会有Shine-Dalgarno序列区么?有或者没有的原因分别是什么?
  2. 你在每个推测基因中都找到Shine-Dalgarno序列区了么?这说明了什么?



  1. 虽然该基因组可能有不同的启动子区一致序列,我们将使用Escherichia coli的一致序列进行查找。

(-35:TACACT; -10: TATGTT)。请注意,该区域均为T和A的富集区。

b.         你一定会在每个推测基因中都找到启动子区么?是或者不是的原因分别是什么。

c.          你在每个推测基因中都找到启动子序列区了么?如果不是,这表明了什么?






既然你已经得到了推测的基因序列,接下来你可以使用密码子表来预测该基因编码的蛋白序列。这一工作可以在Artemis程序中完成。点击第一条基因(其会被高亮显示),在菜单栏选择“View”菜单选项并点击“amino acids of selection”。程序会打开一个独立的页面并在其中显示选中基因的初级蛋白序列。高亮显示并拷贝这些核苷酸序列。






使用比对到的蛋白信息进行检索。你得到的蛋白的功能是什么?EcoCyc: http://ecocyc.org/.是一个出色的用于E. coli蛋白功能预测的数据库。该数据库的管理员依据基于实验数据的科学文献对E. coli的基因组进行代谢和功能注释。这些实验数据包括了E. coli的蛋白结构,酶功能,基因产物的调控以及代谢通路的构建信息。如果你预测的蛋白质匹配到了这个数据库中的信息,你会对你的蛋白质有进一步的了解并预测更多的功能。

进入到EcoCyc的页面:在右侧输入框中输入你推测到的蛋白的名称(或4个字母表示的基因名称)。点击快速搜索按键。如果在E. coli K-12的基因组中有相似的蛋白,这些信息将被展示在“protein”下面。点击你的蛋白的名称。





Artemis是一个免费的基因组浏览器和注释工具,它允许在序列的上下文中,以及六框阅读翻译过程中对序列特征、下一代数据分析结果进行可视化。它由英国的Sanger研究所(the Sanger Institute)开发维护。Artemis 是用 Java 编写的,可用于 UNIX、Macintosh 和 Windows 系统。 它可以以 FASTA、索引式 FASTA 或原始格式读取 EMBL 和 GENBANK 数据库条目或序列。其他序列特征可以是 EMBL、GENBANK 或 GFF 格式。你可以在它的下载主页http://sanger-pathogens.github.io/Artemis/Artemis/ 查看并下载适合你的计算机操作系统的版本。

由于Artemis由 Java 编写,它的运行也需要 Java 环境,因此你需要在你的计算机上下载并安装Java 等相关软件。下载并安装适合你的计算机操作系统的 Java 软件后,可以获得 Java 运行时环境 (JRE),相关下载请参考 Java 主页https://www.java.com/zh-CN/download/,同时,你还需要在 Java development kit (JDK)主页https://www.oracle.com/java/technologies/downloads/#jdk18-windows,下载并安装与之前下载的 JRE 相同版本的JDK以确保正常使用。


The English version can be seen here

Hands-on genome annotation

Part 1: Gene function

How are genes identified? What is a gene?

Usually, one of the first steps upon identifying all of the nucleotide sequences in a genome is to identify all potential genes that may exist in the genome. Typically, this is completed using a computer program that runs a variety of mathematical algorithms to identify potential genes. However, to better understand this process, it is important to identify the critical components of a gene. Therefore, in this exercise, we will use a small portion of a sequenced genome to identify a gene.

We will be working with a bacterial chromosome. Bacterial chromosomes are commonly constructed as a single circular double-stranded piece of DNA. Their size can range from just under one to 6 million nucleotides.

Before starting this exercise, update yourself on gene definition (in bioinformatics context, gene structure, and gene function) and formulate an answer to a superior easy question: what role does gene play in the central dogma (the process of transcription and translation)?

Imagine that a newly identified bacterium has just had its genome sequenced. Propose how you might identify a gene in its genome (i.e., what will you look for in the sequence to characterize a gene?).

The conserved components of all genes are the start and stop codons for translation. Therefore, the first step should be identifying the start and stop codons to predict where protein borders are. Then, it would be essential to identify promoters and Shine-Dalgarno sites to verify the existence of the predicted gene.

Please note that promoters and Shine-Dalgarno sequences, although relatively conserved within an individual bacterium, vary from organism to organism. In addition, not all genes will have a promoter directly in front of the gene in an operon. Therefore, although these sequences are necessary, they are not commonly used for the initial identification of genes.

How might gene structure be similar or different when looking for genes in bacteria and eukaryotes?

Differences: Promoters are different in bacteria and eukaryotes. Eukaryotes contain a single sequence promoter (-35). This is because they use structurally different RNA polymerases and transcription factors responsible for binding to the promoter. Eukaryotes do not have a Shine-Dalgarno for ribosome binding. Instead, eukaryotes use the 5’ cap for ribosome binding.

Similarities: Start and stop sites for translation are similar. The structural location of gene components (i.e., the promoter, site for transcription, and translation starts and stops) is similar.

Below is a single strand of DNA taken from a bacterial chromosome. Identify the pieces listed below and answer the following questions to help you find the gene in this sequence. This strand represents the coding strand, meaning this is the strand that would be read during translation.




The items below are all components of a gene. Please define each of these items. For example, when asked to define the start site for transcription, please identify what occurs at this location. In addition, please identify and label each region on the sequence above:

  • Promoter (promoter regions are rich in Ts and As)
  • Pribnow or -10 region: tatgtt
  • -35 region: tttaca
  • start site for transcription (+1; figure out from promoter)
  • start site for translation (atg)
  • stop site for translation taa, tag or tga
  • Shine-Dalgarno site (agga)
  • Stop site for transcription termination (this will be an approximation)

Please feel free to find your way to present the results in the most informative way. First, I would import the sequence in GeneDoc, make the font large (25-30), and remove the consensus line (use Ctrl+G for the setup options). Then I would use one of the multiple web tools (or MEGA) to translate the nt sequence to aa (for example, ExPASy is very handy). So, you may immediately get an idea of whether there is a meaningful reading frame in this sequence. Then I would use the search tool in GeneDoc and search for essential motives. Finally, I would also use a snipping tool to copy the view from GeneDoc and paste it in, for example, PowerPoint for annotation. However, once again, you are skillful in finding how to present these relatively simple data.

What is the primary protein structure encoded by this gene? Remember, this strand represents the same sequence as the RNA copied during transcription. Although ideally, I would expect you to use the genetic code and do the manual translation of codons to amino acids, it should be practiced manually at least once. Replace the T’s with U’s to use the triple codon table. Of course, you may do the translation with one of the numerous web-based translating tools. Maybe you would wish to do both and compare the outcome.

You should have all of the terms mapped on the sequence. The order of these terms is essential for the successful transcription and translation of a gene. To address its importance, please answer the following questions regarding the gene’s structure:

  • Where are the start and stop sites for translation concerning the start and stop site for transcription? Why is this important?
  • Where is the Shine-Dalgarno site related to starting sites for transcription and translation? Why is this important?
  • Where is the promoter related to the start sites for transcription and translation? Why is this important?

Please note that this sequence has been composed for the exercise (the polypeptide length has been truncated). However, if you are curious enough, you may use the sequence similarity search and identify the accession number of the entire gene and the protein used for this exercise.

(optional) What does it encode? Is the protein secreted?

Part 2: Annotation of an unknown sequence

This sequence is obtained from the NCBI database: Escherichia coli K12 subst. W3110 (ref: EF136884.1). No gaps or spaces in the sequence can exist when working with this sequence in one or another annotation tool. Paste this sequence into a Notebook or Word file and save it as a .txt file for use.

Learning goals:

  • Verify the presence of a gene by identifying its key components.
  • Identify critical components of an operon.
  • Predict the function of a protein-based on its gene sequence.
  • Describe the basic steps for annotating a genome.

This exercise will guide you through these first steps of manual genome annotation. You have been given an unknown sequence (part of a genome) to decipher the number and putative function of any potential genes in this sequence. Follow the steps below to complete the activity.

Obtain your nucleotide sequence. It is approximately 5500 nucleotides long. This is a single-stranded representative of a double-stranded DNA chromosome. Make sure you have the sequence in a .txt file. In addition, make sure there are no spaces or gaps between the nucleotides. This can result in inaccurate analysis by the gene-finding program we will use.

Here you have an open-end evaluation process!

One option may be based on the analysis of sequence similarity. It may only be possible because we work with the famous model organism with multiple genomes sequenced and deposited. Consequently, the search using the blast algorithm may be helpful. However, be prepared to spend some time locating your tiny 5500 nt-long fragments in the reference sequence you get by the blast. It would be tedious manual work. If you go this way, please retrieve all nt sequences for your detected proteins and align them to the target sequence (you may also wish to work on aa level, then do translation and consider all six reading frames). Always record accession numbers!

Another option is to use a computer program developed for this purpose. For example, you may use the program Artemis: Genome Browser and Annotation Tool (Rutherford et al., 2000 Sanger Institute). Artemis is a free genome browser and annotation tool that allows the visualization of sequence features, next-generation data, and the results of analyses within the sequence context and its six-frame translation.

Artemis is written in Java and is available for UNIX, Macintosh, and Windows systems. It can read EMBL and GENBANK database entries or sequences in FASTA, indexed FASTA, or raw format. Other sequence features can be in EMBL, GENBANK, or GFF format. This free program uses a mathematical algorithm to identify potential genes in the genome sequence. In addition, it will identify the translation start and stop sites for you.

Find Artemis: Genome Browser and Annotation Tool (locate yourself), click on the “Download” tab. You will be directed to another page where you can download the software onto your computer or launch it directly. To avoid downloading the program, click on the button that says “Launch Artemis.”

A small screen with the program title (Artemis) will open. Click on File and then on Open. Find and open your unknown sequence (The program automatically looks for sequence files. You will have to change the files it looks for by changing the search to “all files” instead of “sequence files.”) If successful, you should see your sequence in the program.

To briefly describe what is seen in Artemis, there are three rows containing black lines at the top of the screen, then two gray bars, followed by another three sets of rows containing the black lines. Each row represents one of the six frames for translation on the sequence. The top three represent the frames for reading the codons on the top strand of DNA, while the bottom three represent the frames for reading the codons on the bottom strand of DNA. The black lines represent stop codons within that reading frame. Use the right scroll bar to focus on the image and focus back out. At the bottom of the program is the information in more detail. The six reading frames are represented, and the amino acid code for all six translational frames is included. In addition, the sequence is located here as well. You can use the bottom scroll bar to move along the entire sequence. You can use the right scrollbar to focus on an area of focus on an area.

Identify putative genes in your unknown sequence:

Let us use the program to find putative genes in your sequence.

To activate the algorithm to identify putative genes, find the “create” tab. Then, scroll down and select “mark open reading frames…” (an open reading frame or ORF is another term for a putative gene).

Artemis will ask the minimum size for the open reading (how many amino acids long?). In order to prevent the identification of very small genes (that are likely not genes but are merely a start and stop codon in close proximity), we are going to search for open reading frames over 200 amino acids long. Type in 200.

Any identified ORFs or putative genes should be highlighted in blue. How many putative genes do you see?

Let us verify the start codon. As shown by the many black lines, the program does an excellent job identifying stop codons. First, however, it is essential to look at what start codon is being used. For example, some bacteria have a preference for certain stop codons. In this scenario, we will look for the traditional ATG start codon.

In the bottom (more magnified) window, scroll (using the bottom scroll bar) to the front of the first gene or double click on the first gene in the top window. This will highlight the gene sequence in the bottom window. Does it start with an ATG start codon? If not, you can “trim” the gene to our desired start codon.

If the gene does not start with ATG, highlight the gene (by clicking on it). Now locate the “edit” tab at the top of the program. Next, select the “trim selected feature.” Further, select “trim to met.”

Do this for your other genes.

How can we verify that these might be genes?

The Artemis program has found the start and stop sites for the translation of each gene. What other structures/sequences might you look for to help verify that these are genes? List any structures you might look for here:

You might have decided to look for the Shine Dalgarno and promoter regions. Here is a guide to examining the genes for these important gene features.

Find the Shine-Dalgarno region:

Where would the Shine Dalgarno be present on a gene? Go to this area of the gene in the bottom window. For example, look for the following Shine Dalgarno sequences: AGGA or AGCA.

  • Should a Shine Dalgarno be present for each gene? Why or why not?
  • Did you find one for each gene? What does this information provide?

Find the promoter region:

Where would the promoter be present on a gene? Go to this area of the gene in the bottom window.

  • Although this genome may have different promoter consensus sequences, we will use a set of Escherichia coli consensus sequences.

(-35 sequence: TACACT; -10 sequence: TATGTT). Note that they are rich in T’s and A’s.

  • Must you find a promoter for each gene? Why or why not?
  • Did you find a promoter for each gene? If not, what does this suggest?

Are these genes?

Based on the evidence you have collected so far, reflect on whether you believe the identified genes are indeed genes. Then, provide evidence to support your claim.

What is the putative function of these genes?

Now that you have finished a basic analysis of your gene nucleotide sequence, it is time to examine the putative function of the encoded product of these genes. What do these proteins do for the cell? Will you trust your prediction, or will laboratory research still be required to verify the result?

Collect the protein sequence for your genes to conduct your comparison.

Because you have the gene sequence, you could use the codon tables to read the predicted amino acid sequence for this gene. However, in Artemis, the program will do the translation for you. Click on the first gene (so it is highlighted). On the menu, select the “View” tab and click on “amino acids of selection.” This will open a separate page with the primary amino acid sequence for the highlighted gene. Highlight and copy these nucleotides.

Compare your protein sequence with a database of proteins whose functions have been verified.

What is the protein name that it is similar to?

What organism does this protein come from (you can find this information at the top of the alignment)?

Reflect on what this information tells you? is it likely that your protein has a function similar to the aligned protein from the database? Please provide evidence to support your claim.

Function information

Use the protein information that you have received from your alignment to begin your research. What is the function of your protein(s)? One excellent database to help predict an E. coli protein’s function is EcoCyc. Curators of the database use the scientific literature based on experimental data to compile a metabolic and functional description of the Escherichia coli genome. Experimental data include protein structure data, enzymatic function, regulation of these gene products, and construction of metabolic pathways within the organism. Should your putative proteins match this database, much can be learned and predicted regarding its function.

Go to the EcoCyc webpage: Type in the name of each putative protein you have identified (or the four-letter gene name) into the box on the right. Then, click on a quick search. If a similar protein has been identified in the E. coli K-12 genome, these should be listed after the search under “proteins.” Next, click on the name of your protein.

What will the results tell you? First, you should see information regarding the regulation of this gene and information regarding the protein and its location and function. Research as much as you can regarding the putative proteins you have identified. Describe their putative functions here, based on the information you can find on the EcoCyc database.

Complete a functional analysis (BLAST and EcoCyc analysis) for all of your putative proteins. Please address: do these proteins have anything in common regarding their function? Might they be related?

Discuss any mechanisms of regulation that might exist for these proteins.

中文练习说明 Chinese version

Daly et al. 2021 From lignocellulose to plastics: Knowledge transfer on the degradation approaches by fungi


Daly P, Cai F, Kubicek CP, Jiang S, Grujic M, Rahimi MJ, Sheteiwy MS, Giles R, Riaz A, de Vries RP, Bayram Akcapinar G, Wei L, Druzhinina IS (2021) From lignocellulose to plastics: Knowledge transfer on the degradation approaches by fungi, Biotechnology Advances, 50,
107770, https://doi.org/10.1016/j.biotechadv.2021.107770.

In this review, we argue that there is much to be learned by transferring knowledge from research on lignocellulose degradation to that on plastic. Plastic waste accumulates in the environment to hazardous levels, because it is inherently recalcitrant to biological degradation. Plants evolved lignocellulose to be resistant to degradation, but with time, fungi became capable of utilising it for their nutrition. Examples of how fungal strategies to degrade lignocellulose could be insightful for plastic degradation include how fungi overcome the hydrophobicity of lignin (e.g. production of hydrophobins) and crystallinity of cellulose (e.g. oxidative approaches). In parallel, knowledge of the methods for understanding lignocellulose degradation could be insightful such as advanced microscopy, genomic and post-genomic approaches (e.g. gene expression analysis). The known limitations of biological lignocellulose degradation, such as the necessity for physiochemical pretreatments for biofuel production, can be predictive of potential restrictions of biological plastic degradation. Taking lessons from lignocellulose degradation for plastic degradation is also important for biosafety as engineered plastic-degrading fungi could also have increased plant biomass degrading capabilities. Even though plastics are significantly different from lignocellulose because they lack hydrolysable C-C or C-O bonds and therefore have higher recalcitrance, there are apparent similarities, e.g. both types of compounds are mixtures of hydrophobic polymers with amorphous and crystalline regions, and both require hydrolases and oxidoreductases for their degradation. Thus, many lessons could be learned from fungal lignocellulose degradation.

Zhao et al. 2021 At least three families of hyphosphere small secreted cysteine-rich proteins can optimize surface properties to a moderately hydrophilic state suitable for fungal attachment

Zhao et al., 2021 Hyphosphere concept

Zhao, Z., Cai, F., Gao, R., Ding, M., Jiang, S., Chen, P., Pang, G., Chenthamara, K., Shen, Q., Bayram Akcapinar, G. and Druzhinina, I.S. (2021), At least three families of hyphosphere small secreted cysteine-rich proteins can optimize surface properties to a moderately hydrophilic state suitable for fungal attachment. Environ Microbiol. https://doi.org/10.1111/1462-2920.15413

The secretomes of filamentous fungi contain a diversity of small secreted cysteine-rich proteins (SSCPs) that have a variety of properties ranging from toxicity to surface activity. Some SSCPs are recognized by other organisms as indicators of fungal presence, but their function in fungi is not fully understood. We detected a new family of fungal surface-active SSCPs (saSSCPs), here named hyphosphere proteins (HFSs). An evolutionary analysis of the HFSs in Pezizomycotina revealed a unique pattern of eight single cysteine residues (C-CXXXC-C-C-C-C-C) and a long evolutionary history of multiple gene duplications and ancient interfungal lateral gene transfers, suggesting their functional significance for fungi with different lifestyles. Interestingly, recombinantly produced saSSCPs from three families (HFSs, hydrophobins and cerato-platanins) showed convergent surface-modulating activity on glass and on poly(ethylene-terephthalate), transforming their surfaces to a moderately hydrophilic state, which significantly favoured subsequent hyphal attachment. The addition of purified saSSCPs to the tomato rhizosphere had mixed effects on hyphal attachment to roots, while all tested saSSCPs had an adverse effect on plant growth in vitro. We propose that the exceptionally high diversity of saSSCPs in Trichoderma and other fungi evolved to efficiently condition various surfaces in the hyphosphere to a fungal-beneficial state.