The subset X-0,={AAC,AAT,ACC,ATC,ATT,CAG,CTC,CTG,GAA,GAC,GAG,GAT,GCC,GGC,GGT,GTA, GTC,GTT,TAC,TTC} of 20 trinucleotides has a preferential occurrence in the frame 0 (reading frame established by the ATG start trinucle...
详细信息
The subset X-0,={AAC,AAT,ACC,ATC,ATT,CAG,CTC,CTG,GAA,GAC,GAG,GAT,GCC,GGC,GGT,GTA, GTC,GTT,TAC,TTC} of 20 trinucleotides has a preferential occurrence in the frame 0 (reading frame established by the ATG start trinucleotide) of protein (coding) genes of both prokaryotes and eukaryotes. This subset X-0 is a complementary maximal circular code with two permutated maximal circular codes X-1 and X-2 in the frames 1 and 2 respectively (frame 0 shifted by one and two nucleotides respectively in the 5'-3' direction). X-0 is called a C-3 code (Arquis and Michel, 1997, J. Biosyst 44, 107-134). A quantitative study of these three subsets, X-0, X-1, and X-2 in the three frames 0, 1 and 2 of eukaryotic protein genes shows that their occurrence frequencies are constant functions of the trinucleotide positions in the sequences. The frequencies of X-0, X-1 and X-2 in the frame 0 of eukaryotic protein genes are 48.5%, 29% and 22.5% respectively. These properties are not observed in the 5' and 3' regions of eukaryotes where X-0, X-1 and X-2 occur with variable frequencies around the random value (1/3). Several frequency asymmetries unexpectedly observed, e.g. the frequency difference between X-1 and X-2 in the frame 0, are related to a new property of the C-3 code X-0 involving substitutions. An evolutionary analytical model at three parameters (p, q, t) based on an independent mixing of the 20 codons (trinucleotides in the frame 0) of X-0 with equiprobability (1/20) followed by t approximate to 4 substitutions per codon according to the proportions p approximate to 0.1, q approximate to 0.1 and r = 1 - p - q approximate to 0.8 in the three codon sites respectively, retrieves the frequencies of X-0, X-1 and X-2 observed in the three frames of protein genes and explains these asymmetries. The complex behaviour of these analytical curves is totally unexpected and a priori difficult to imagine. Finally, the evolutionary analytical method developed could be applied to the phylogenetic tr
A statistical analysis with 12288 autocorrelation functions applied in protein (coding) genes of prokaryotes and eukaryotes identifies three subsets of trinucleotides in their three frames: T-0 = X-0 boolean OR {AAA;T...
详细信息
A statistical analysis with 12288 autocorrelation functions applied in protein (coding) genes of prokaryotes and eukaryotes identifies three subsets of trinucleotides in their three frames: T-0 = X-0 boolean OR {AAA;TTT} with X-0 = {AAC, AAT, ACC, ATC, ATT, GAG, CTC, CTG, GAA, GAG, GAG, GAT, GCC, GGC, GGT, GTA, GTC, GTT, TAC, TTC} in frame 0 (the reading frame established by the ATG start trinucleotide), T-1 = X-1 boolean OR {CCC} in frame 1 and T-2 = X-2 boolean OR {GGG} in frame 2 (the frames 1 and 2 being the frame 0 shifted by one and two nucleotides, respectively, to the right). These three subsets are identical in these two gene populations and have five important properties: (i) the property of maximal (20 trinucleotides) circular code for X-0 (resp. X-1, X-2) allowing to retrieve automatically the frame 0 (resp. 1, 2) in any region of the gene without start codon;(ii) the DNA complementarity property C (e.g. C(AAC) = GTT): C(T-0) = T-0, C(T-1) = T-2 and C(T-2) = T-1 allowing the two paired reading frames of a DNA double helix simultaneously to code for amino acids. (iii) the circular permutation property P (e.g. P(AAC) = ACA): P(X-0) = X-1 and P(X-1) = X-2 implying that the two subsets X-1 and X-2 can be deduced from X-0;(iv) the rarity property with an occurrence probability of X-0 = 6 x 10(-8);and (v) the concatenation properties in favour of an evolutionary code: a high frequency (27.5%) of misplaced trinucleotides in the shifted frames, a maximum (13 nucleotides) length of the minimal window to retrieve automatically the frame and an occurrence of the four types of nucleotides in the three trinucleotide sites. In Discussion, a simulation based on an independent mixing of the trinucleotides of T-0 allows to retrieve the two subsets T-1 and T-2. Then, the identified subsets T-0, T-1 and T-2 replaced in the 2-letter genetic alphabet {R, Y} (R = purine = A or G, Y = pyrimidine = C or T) allow to retrieve the RNY model (N = R or Y) and to explain previous wor
暂无评论