Biology 442 - Human Genetics
The Human Genome, DNA, Chromosomes & Gene Structure

On June 23, 2000, there was an announcement that the first draft of the human book of life had been revealed. The resulting sequence is approximately 3.12 billion bases long and includes more than 99.9% of the human genome. Some scientists had organized a "gene pool" where, for $1, you could record a prediction of how many human genes there would be. The best guesses ranged from 30,000 to 120,000. It is now believed that there are approximately 22,000 protein coding genes..only one-third to one-fourth as many as we once believed and only twice as many as those in Caenorhabditis elegans, a nematode (round worm). The "final edition" of the Human Genome was completed in 2003, the year of the 50th anniversary of Watson and Crick's revelation of the structure of DNA in 1953.

However, even though the Human Genome Project was completed a decade ago, exactly how many genes are in the human genome is still a mystery. Genes are not in a linear sequence as was once thought. Protein coding is in "exons" and these exons can be separated by DNA sequences that don't code for proteins. Some proteins can be put together by "mix and match" so that exons can be used to produce a multiplicity of different proteins. Many of the are DNA sequences which do not code for proteins play important and critical roles in the regulation and expression of other genes. We are only beginning to understand how genes and their products interact to produce the whole organism and how it functions. But we have come a long way in a rather short span of time!

The Human Genome Project

General Aspects of the Human Genome Project

The Human Genome Project, an international effort to map all the human chromosomes and also chromosomes of other organisms, began in 1990 and was projected to be completed in 2003 but was completed three years ahead of schedule partially because of the entry of Craig Venter in 1998. The Human Genome Project began as a publicly funded, international consortium of scientists led by Francis Collins. The funding came primarily from the National Institutes of Health and the Department of Energy and also from a British charity, the Wellcome Trust. Then in 1998, Craig Venter (who had been at NIH) announced that his new company, Celera, could do the job faster and cheaper. And he did. While much work remained to fill in the gaps, this was an amazing accomplishment and it was done in an amazingly short number of years.

Leroy Hood, a key player in the elucidation of the human genome, predicts that within 5 years we will be able to sequence each persons genome within a few hours for a reasonable cost. He foresees an era of Systems Biology whereby we will be able to prevent disease by knowledge of each person's unique genome and our accumulated knowledge of how genes are regulated and how the protein products interact within the cell. A whole new field of nanotechnology is underway and will lead to a new era in medicine which personalizes health care. Since genes encode proteins, an essential ingredient of this new systems approach to health, is knowledge of the proteome, the constellation of all proteins in a cell. The understanding of gene regulation, interactions of proteins, and environmental influences are currently areas of intense research. Based on the knowledge of each person's unique genome along with the knowledge of how our proteins function, we will be able to predict what diseases each person is predisposed to and it will be possible to administer personalized "medicines" that will prevent the disorder.

and Some History

It took 100 years from the time the German scientist Friedrich Miescher first isolated nucleic acids from pus in 1869 (white blood cells) taken from bandages, for scientists to realize nucleic acids were the genetic material. Avery, MacLeod and McCarty using bacteria (1944) and Hershey and Chase (1952) using bacteriophage, provided evidence that DNA was the genetic material. Watson and Crick in 1953 received the Nobel Prize for providing the model of the structure of DNA showing how replication, coding and mutation could be explained on the basis of their structure. And 50 years later, the entire human genome has been sequenced! One scientist has compared this accomplishment to the 1543 publication of the first book on human anatomy. Even though that book identified almost every part of the human body, today we are still struggling to understand how many of the parts work and how they interact. So the party has only begun!

The 1970's and 80's was a time of intense gene mapping and later gene sequencing especially after the development of recombinant DNA techniques. All of this led up to the Human Genome Project. In 2001, with 90% of the genome sequenced, there was a major progress report (see photo at the beginning of this page. The gaps were filled in in time for the 50th anniversary of Watson and Crick's discovery of the structure of the DNA double helix model. At that time 99% of the gene containing regions had been sequenced to an accuracy of 99.999%. Everyone was very surprised when it was revealed that the Human Genome contained only 25-35,000 genes....only about twice as many as some insects and worms!

Only a very small amount of our DNA is responsible for the differences among humans, indeed among all organisms. The genome is approximately 99.9% the same between individuals of all nationalities and backgrounds.


Over 50% of the human genome shows a high degree of sequence similarity to genes in other organisms. Not only is there no correlation with the amount of DNA and the complexity of the organism, most of our DNA is said to be "junk" DNA. What is meant by "junk" is that it does not code for proteins...the molecules that control all chemical reactions and form most intra and extra cellular structures. We know some of the functions of this non-coding DNA and undoubtedly we will continue to find out more about its function as time goes on. Only 1% of our DNA codes for proteins. The vast majority of our DNA is non-protein-coding, and repetitive DNA sequences account for at least 50% of the non coding DNA. The genome contains approximately 20,000-25,000 protein coding genes. Many human genes are capable of making more than one protein, allowing human cells to make perhaps 80,000-100,000 proteins from only 20,000-25,000 genes.

It is amazing to note that DNA was only found to be the genetic material in 1944 when Avery, MacLeod and McCarty discovered that DNA isolated from encapsulated pneumococci could be used to restore the capsule making ability to mutant pneumococci not able to make a capsule. No other molecule (not protein, RNA, carbohydrates) from the encapsulated bacteria could restore this ability. In only a few more years in 1950 Watson and Crick announced that they had uncovered the DNA as a double stranded helix. Their model could explain coding, mutation, and replication. It was not very long before the entire set of 64 codons and their corresponding amino acids was deciphered.

Similarities between groups of organisms

Many genes and gene alignments have been found to be common among organisms. In spite of the variation in chromosome number there are a number of genetic and physical linkages between single-copy genes that are remarkably conserved amid a background of very rapidly evolving repetitive DNA sequences. The term "orthologs" refers to similar genes in different species that encode proteins with the same function. They originated from the same gene in a common ancestor. "Paralogs" are genes/loci that are homologous to other genes in the same species. They are, likely to have originated from a common ancestral gene, for example, the alpha and beta globin genes. In many cases these conserved genes and regions can be identified in humans. There is now a classification system based on orthologous relationships between genes which appears to be a natural framework for comparative genomics and should facilitate both functional annotation of genomes and large-scale evolutionary studies

Obviously, the genes of more closely related species are more similar. Human DNA is 99.9% identical between individuals. We share genes and gene alignments closely with the other primates, the chimpanzees, orangutans, and gorillas, but we also share genes with bacteria and yeast as well as other more primitive organisms. Evolution is a tinkerer, the same basic material is used over and modified for new cellular functions.

Genomes of many other "model" organisms have been and are being sequenced including bacteria, yeast, roundworms, fruit flies, fish, mice, dog, and more. They are important because we can do sophisticated genetic studies on them that we cannot ethically do with humans such as specific matings, and inserting or removing genes. Important genes are highly conserved from species to species. Over 50% of the human genome shows a high degree of sequence similarity to genes in other organisms. Genes with sequence similarity are called homologs (orthologs and paralogs are types of homologous genes). For example, the "obese" (OB) gene which produces the protein, leptin, that affects brain cells to suppress appetite and stimulate metabolism of food was first discovered in mice. Mice deficient in leptin become obese.

The results of comparative genomics are proving to be interesting in the elucidation of gene evolution and the similarities and differences among organisms. We share about 96% of our genes with chimpanzees, 80% with mice, 75% with dogs, 50% with the fruit fly, Drosophila, 40% with roundworms (nematodes) and 30% with yeast. We even have about 100 genes in common with many bacteria. This is all evidence of our evolutionary past. Those genes we have in common with other organisms must be very important. Mutated genes in humans also cause disease in fruit flies. Of the mutated genes in 289 human disease conditions 61% are found in the fruit fly. They include genes involved in prostate cancer, pancreatic cancer, cardiac disease, cystic fibrosis, leukemia and others. What defines us genetically is the complexity of how our genes are used and how these genes interact with one another to carry out the myriad of functions and give us the unique characteristics of humans. New discoveries occur on a daily basis....keep your ears and eyes open!

Browse a Genome

There are coding and non coding sequences in nuclear DNA.

Twenty-five to thirty percent of our genome is unique or single copy DNA that includes the genes that code for proteins. "Real" genes that code for proteins have both coding (exons, start and stop codons) and non coding DNA regions (promoters, introns, RNA processing signals). We know that many of these non coding regions are functionally necessary. The coding regions of genes are only a small proportion of the single copy DNA sequences since eukaryotic genes have introns and other non coding regions. There are also non coding regions between genes.

Much of the non coding DNA is highly or moderately repetitive. Repetitive DNA can be dispersed or tandemly arranged. For much of this DNA the functions have not been completely established. It is often called "satellite DNA" because when centrifuged in a density gradient this DNA forms bands separate from the bulk of genomic DNA. There are three satellite bands (although only one shows up on the figure below). One is classic satellite DNA (100-6500 bp repeats), minisatellite DNA (20-100 bp repeats), and microsatellite DNA (CA)n repeats (n=2-10 bp)

Cesium chloride density gradient showing the main band of DNA and the satellite (repetitive) DNA

One type of repetitive DNA codes for rRNA and tRNA which form gene clusters. The rRNA genes in humans are found tandemly arranged on the p arms of the five D and G groups chromosomes (13, 14, 15, 21, 22). These regions are referred to as the NOR or nucleolar organizing regions and they form the nucleolus of the interphase cell. The nucleolus has a fibrous portion which is open rDNA being transcribed into rRNA and a granular region where ribosomes are being assembled (the ribosomal proteins are made in the cytoplasm and must be transported back into the nucleus).


Pseudogenes are related to functional genes but are no longer capable of being transcribed or translated. One type of pseudogene arose from duplications of genes which then acquired mutations rendering them untranscribable or untranslatable. These pseudogenes cannot be transcribed or translated due to the accumulation of fatal errors such as a nonsense mutations or mutations in the promoter. Another type of pseudogene is referred to as a retro pseudogene because it arose by reinsertion of a cDNA made by a reverse transcriptase using an mRNA template. This type contains no introns and no promoter region since these sequences were spliced out of the original RNA transcript. Without a promoter, they cannot be transcribed.

Two thirds of the repetitive non coding DNA sequences is in more complex repeated sequences dispersed or scattered throughout the genome. These can be further divided into short and long interspersed sequences, SINES and LINES. They are mobile elements within our genome.

LINES are up to 7000 bp in length and represent about 4% of our total human genome. LINES contain a transposable element which makes an RNA coding for reverse transcriptase. The transcriptase can make cDNA from RNA which can reintegrate into another site. LINES are found in the dark G bands of banded chromosomes. G bands are rich in AT and, therefore, less CpG islands (CpG islands are common in "housekeeping gene" promoters), and have fewer genes. (The Y chromosome has more LINES than the X and the X has more than the autosomes. There is less meiotic recombination where there are LINE elements.

SINES are shorter interspersed elements 90 to 500 bp in length. One well known SINE is the Alu sequence which is about 300 bp in length. Alu sequences are unique to humans (and some apes), this is the most frequent human SINE (approximately 5 x 105 copies, 3 - 6% of the total human genome). These SINES are named for the restriction enzyme, Alu, which cuts at AGCT, commonly found in the repeat. Alu is named for the bacterium, Arthrobacter luteus, from which the restriction enzyme comes.

SINES can be found in introns, exons and extragenic sequences


These transposable elements can be a source of mutation. For example, in some hemophilia A patients there is an insertion of an L1 sequence into an exon. A patient with NF1 (neurofibromatosis type 1) contained an inactivating Alu in the normal allele. These repetitive sequences can play a role in rearrangements and gene duplication (e.g., beta globin genes). Repetitive sequences are commonly found near sites of deletions and duplications

LTR (long terminal repeat) retro transposons make up a large fraction of the typical mammalian genome. They comprise about 8% of the human genome. On account of their abundance, LTR retro transposons are believed to hold major significance for genome structure and function. Long terminal repeats (LTR's) are, as the name suggests, long repeating sequences of DNA several hundred nucleotides long found at either side of pro viral DNA. LTR's are believed to play some role in the integration of viral DNA into the host genome, as they are found on retroviruses and transposons.


Mitochondrial DNA is a single double stranded circular molecule. There are several copies in each mitochondrion and there are many mitochondria in each of your cells. Mitochondria originated by endosymbiosis of a prokaryotic cell early in the evolution of eukaryotic cells. Mitochondrial DNA is similar to prokaryotic DNA. There are no histones or any other protein associated with mt DNA. The genes contain no introns. Because it is in a highly oxidizing environment it has a much higher rate of mutations than nuclear DNA. The genes in mt DNA code for mitochondrial ribosomes and transfer RNAs. Some genes code for polypeptide subunits of the electron transport chain common to all mitochondria. It relies on nuclear gene products for replication and transcription.



Genes may be unique sequences or belong to a gene family such as the globins, actins, myosins, histones, tubulins that are repetitive. Gene families refer to genes with similar DNA sequences which arose through duplication of an ancestral gene followed by generations of mutations. Gene families may be close to one another in clusters or they may be dispersed, they may form a cluster on the same chromosome or they may be located on different chromosomes. Hemoglobin is a tetramer composed of four peptide chains, two alpha and two beta globins. The alpha globin gene cluster is on human chromosome 16 and the related beta globin cluster is on chromosome 11. Examples of gene families include rDNAs, tDNAs, the histone genes, P450 enzyme superfamily, hemoglobin genes, actin genes. Pseudogenes may be part of a gene cluster or family. These gene duplicates are now evolutionary relics.

Paralogs are the result of a gene duplication event arising after speciation. Genes in two species that have directly evolved from a single gene in the last common ancestor are called orthologs.


The classic macro satellite DNA has repeats of 100 to 6500 bp. This category includes tandemly repeated satellite DNA from the centromeric repeats (171 bp) unique to each chromosome and the telomeric repeats. The centromeric repeats are referred to as alpha satellite DNA and each chromosome has its unique sequence. Therefore, it is possible to make DNA probes specific to each of our 24 different chromosomes. When a fluorescent label is added to the probe, it is possible to count the number of each type of chromosome even in an interphase cell. Therefore, it is possible to check for trisomies in interphase amniotic fluid cells prior to culturing them for karyotyping.

Another type of tandemly repetitive DNA is referred to as mini satellite sequences or VNTRs (variable number of tandem repeats). They are composed of 20 to 100 bp repeats. The third type of tandemly repetitive DNA is referred to as micro satellite sequences or STRs (short tandem repeats) composed of 2 to 10 bp repeats. Since the number of repeats in micro and mini satellites are highly variable (polymorphic) they are very useful in gene mapping and DNA profiling for paternity testing, forensic testing, confirmation of relatedness and dead body identification. Both VNTRs and STRs are polymorphisms in non coding regions and are inherited in a codominant pattern. They are formed by mutations which add or subtract the number of repeats. Most individuals in the population are heterozygous at each of these loci. There is hyper variable mini satellite DNA preferentially close to telomeres which can cause misalignment which results in deletion and duplication mutations.


Essential Conserved Non Coding DNA Sequences

Many DNA sequences that do not code for proteins are nevertheless essential and their sequence must be conserved in order for them to serve their function. These DNA sequences include promoter sites that bind RNA polymerases, regulatory elements (enhancers, silencers, and locus control regions LCRs) that bind regulatory proteins, the origin of replication sites that bind the DNA replication complex, the centromeric DNA, the telomere DNA, and many others.


Gene Structure

The definition of a gene has evolved over time. It is no longer a "bead" on a string nor is it merely a sequence of bases that codes for amino acids in a single polypeptide chain. While the Beadle and Tatum model of "one gene, one enzyme" is enticingly simple, we have had to move on to acknowledge that genes are far more complex. There are both coding and non coding regions or untranslated regions (UTRs) in the DNA associated with genes. The non coding regions, as mentioned earlier, include promoters, transcriptional regulatory sequences, introns and polyadenylation signals. Post transcriptional processes that modify the initial RNA transcript usually include 5' cap addition, 3' poly A addition, splicing out of introns and sometimes, alternative splicing of introns to form different mRNAs from the same gene. Introns are spliced out of transcribed RNA by a large RNA protein complex, the spliceosome. Post translational cleavage of proteins, while rare, can also occur as in the case of insulin and some hormones. The use of alternative promoters is common and is used to generate cell type specific mRNAs. These alternative promoters may be found within introns of the gene. The human dystrophin (DMD) gene which has more than 79 exons has at least eight different alternative promoters! In humans, the vast majority of genes are transcribed individually and, in these cases, the terms gene and transcription unit are essentially equivalent. The usual linear order at a gene site is: regulatory element(s) (where enhancers or suppressors bind); promoter region (where the RNA polymerase complex binds); transcription start site (in 5' UTR) including CAP site; ATG, translation initiation codon; exon(s) (variable number); introns (between exons, 5'GT and 3'AG, variable number); 5' UTR consisting of a translation stop codon (TAA, TGA, or TAG); AATAAA polyadenylation signal; and the site for addition of poly A tail. Some genes have alternate splice sites so that several different proteins can be produced from the multiple mRNAs that are produced from the same gene.

Regulatory genes code for transcription factors. These proteins may interact directly with DNA or with other transcription factors to work as a complex. The purpose is to control gene turn genes on or off or control the rate of mRNA production. The DNA binding proteins often have similar DNA binding domains within them.



DNA binding motifs in regulatory proteins
3 alpha helices forming an L shape
DNA is unwound with a widened shallow minor groove
The bound SRY causes a 80o bend
Binds to specific sequence of bases A/TAACAAT/A
3-D NMR picture of SRY bound to DN

Alternative Splicing
Exon skipping and splicing of mRNA to make more than one protein from a single gene

The ability to make more than one gene product (polypeptide) from a single gene explains in part how we can have many more gene products than only the number of genes sequenced in the Human Genome project. Above is an illustration of how the CGRP gene can make three different products. Of course the genes that code for antibodies have long been known to "cut and paste" to form the very large number of immunoglobins that we are capable of making.

Some genes do not code for proteins

The genes that code for ribosomal RNAs and for the transfer RNAs are not translated. Also the XIST (X inactivation specific transcript) gene codes for an RNA that does not code for a protein. It makes an RNA that interacts with the X chromosome sequences that are inactivated in the second X chromosome of the female. There are three different RNA polymerases, Pol I that transcribes ribosomal RNA genes, Pol II that transcribes protein coding genes and snRNA, and Pol III that transcribes tRNAs. They each have specific promoter regions to which they bind to begin transcription. Pol III has a promoter site within the tRNA gene.

The classical view of a gene has been greatly altered. We now know that a single region of DNA can be transcribed in a variety of ways to produce many different RNAs. some coding for proteins and others constituting regulatory RNAs. We have known for some time that protein coding regions can overlap and that they can be read in both directions. DNA sequences can produce a variety RNA transcripts used for multiple functions. A new conception of the genome shifts the focus from genes to transcripts.....away from the protein coding regions to the variety of functional RNA transcripts...only some of which are mRNAs and includes the new classes of functional RNAs as they are discovered.

It is intersting to note that many mutations in DNA that are correlated with diseases (breast, prostate, and lung cancers, autism, schizophrenia) have been found to be in regions of DNA that do not code for proteins. At this time the specific function of many of these DNA regions is unknown.

Formerly known as "Junk" DNA

The biggest surprise of decoding the human genome was how few protein coding genes it contained. A surprisingly large amount of DNA has been found to not code for proteins. This "extra" DNA had been thought to be nonfunctional and was initially called "junk DNA." Simpler organisms have less of this "junk" DNA. It is now thought that this "non-coding" or "junk" DNA plays an important role in gene regulation and the increasing complexity of organisms.

We now know that 98% of the genome is transcribed into RNAs and scientists are recognizing that the non-coding RNAs are playing important roles in just about everything the cell does. These RNAs include snRNAs (short nuclear), snoRNAs (short nucleolar)...both of which are located within the nucleus. There are miRNAs (microRNAs), siRNAs (short interfering RNAs) which can modify the activity of protein-coding genes. So in addition to the regulatory proteins known to influence the activity of other genes, we now know that there are many different RNAs that play a regulatory role. In fact, in recent years scientists have been studying an extended family of "non coding" RNA transcripts that play crucial roles in cellular information control determining what proteins are made, their conformation and where or when they are made. They are also emerging as key to how the human brain functions.


Highlights from the Human Genome Project

1. The human genome consists of ~3.1 billion base pairs; 2.85 billion have been fully sequenced
2. The genome is ~99.9% the same between individuals of all nationalities and backgrounds.
3. Less than 2% of the genome codes for genes
4. The vast majority of our DNA is non-protein-coding, and repetitive DNA sequences account for at least 50% of the non coding DNA
5. The genome contains ~ 20000-25000 protein coding genes.
6. Many human genes are capable of making more than one protein, allowing human cells to make perhaps 80000 to 100000 proteins from only 20000-25000 genes
7. Functions for over half of all human genes are unknown
8. Chromosome 1 contains the highest number of genes. The Y chromosome contains the fewest genes
9. Over 50% of the human genome shows a high degree of sequence similarity to genes in other organisms
10. Thousands of human disease genes have been identified and mapped to their chromosomal locations

Web Sites:

The Search for the Genetic Material

Human Genome Project Information Site



Chromosomes (colored bodies) had been seen in the light microscope in the nineteenth century and had been identified as units of heredity. However, not until 1956 was the correct number of 46 for the human chromosomes known. It was a serendipitous laboratory accident using hypotonic saline which swelled the cells thereby allowing the chromosomes to separate sufficiently to get an accurate count. At first the chromosomes were only identified by relative length and the position of their centromeres. By these criteria, they were separated into 7 groups, A, B, C, D, E, F, G plus the X and Y sex chromosomes. Then in the 1960's staining techniques which produced banding was introduced. This G-banding (Giemsa) made the identification of each chromosome much easier and it also allowed us to see more subtle chromosome structural changes. While the basis of G-banding is still not known, we do know that the darker bands are AT rich, contain fewer genes, and replicate late in the S phase of the cell cycle while the lighter bands are GC rich, are gene rich, and replicate early in the S phase. In 1971 at a conference in Paris, scientists got together to draw up a system of numbering of the bands and sub bands. The result is the Paris Conference ideogram. In making the assignments, however, they did make one mistake. Chromosome 22 has more DNA than chromosome 21 and thus the numbers should have been reversed. Since an extra chromosome 21 was already associated with Down syndrome, there was a decision not to change the numbering.

Paris Conference ideogram.



In humans, all of our 3 billion base pairs and approximately 30,000 genes are compacted and packaged into 23 pairs of chromosomes and 24 linkage groups. In most animals and plants that reproduce sexually, chromosomes come in pairs with one member of each pair from each of the two parents. Each eukaryotic chromosome is composed of a single molecule of double stranded DNA, 5 different histones, and some other non histone proteins.

The basic eukaryotic chromosome structure consists of DNA wrapped around the evolutionarily conserved histones. There are five types of histones: H1, H2A, H2B, H3 and H4. Approximately two turns of DNA wrap around an octamer composed of two molecules each of H2A, H2B, H3 and H4. Histone H1 binds in the region where the DNA enters and exits the nucleosome, presumably stabilizing the DNA at this point. The histones contain a large number of basic amino acids (lysine and arginine) which carry a positive charge and which dampen the negative charge on the DNA molecule (PO4=). Each nucleosome unit includes approximately 200 base pairs of DNA, with about 146 of them wrapped around the octamer of histones. Although nucleosomes must be opened before transcription can occur, the number of bases in a nucleosomes is many less than required for a gene. Because they are essential to the structure of chromosomes, histones must be replicated along with the DNA during the S period of the cell cycle.

Centromeres are specialized regions within chromosomes that play a critical role in the accurate segregation of duplicated chromosomes during cell division. They are the site of kinetochore attachment necessary for spindle attachment. Centromere nucleosomes contain an alternative histone, CenH3, which is thought to define centromere identity and participate in mitotic mechanics. A biochemical and biophysical analysis of centromere nucleosomes in Drosophila nuclei revealed that CenH3 appears in a heterotypic tetrameric half-sized nucleosome, with one copy each of CenH3, H2A, H2B, and H4. These tetramers protect less DNA [~120 base pairs (bp)] than the typical octomers (~150 bp) and do not seem to form as regular higher-order structures as the octomer, yielding longer and more variable DNA linker lengths. This looser chromatin conformation, embedded within heterochromatin, may be critical for tethering the kinetochore to the centromere. (Dalal et al, PLoS Biol. 5, e218 (2007).

Cells use various chemical modifications to histones to alternatively expose or sequester genes, thus turning them on or off. There is a general correlation between patterns of histone "decoration" and gene activity. In particular, parts of chromosomes in which histones are covered with acetyl groups tend to have transcriptionally active genes. Deacetylated histones tend to harbor inactive genes. DNA near methylated histones is generally shut down. Each histone has a "tail," a flexible string of amino acids extending from the DNA-wrapped nucleosome. Acetyl and methyl groups tend to attach to particular amino acids in the tails. The "tails" have evolutionary conserved sequences, implying they are important. The cell is known to have histone acetylases and deacetylases which are implicated in turning genes on and off. New histone-tail decorations beyond methylation and acetylation have now been identified.

Sugars or small proteins including ubiquitin can also mark histones. Some scientists believe that modified histone tails act as sites for the binding of other proteins that influence the accessibility of DNA for gene activity. One such protein is heterochromatin protein1, a molecule known to mediate the silencing of genes. It binds to the amino acid lysine on the tail of histone H3 only if methyl groups adorn the lysine. Histone methylation, unlike acetylation, appears to be stable and transferred during mitosis. Since methylation is less transient, it appears to be more involved in the long-term setting of the cell's genome (in differentiated cells) while acetyl groups frequently hop on and off histone tails. Phosphate groups attached to several amino acids on the tail of H3 indicate the cell is dividing while a phosphate group on a particular serine on the tail of H2B signals that the cell is about to commit suicide (apoptosis).

Late in spermatogenesis cysteine-rich protamines replace the histones. They allow a higher degree of compaction of the DNA in the sperm.

A functional eukaryotic chromosome must contain the following essential components:
1. a centromere which contains satellite DNA unique to each chromosome and the kinetochore which is a protein structure to which the spindle fibers attach;
2. a telomere at each end of the chromosome contains tandem repeats of TTAGGG (3 - 20 kb) a special repetitive DNA necessary to prevent shortening of the chromosome through the numerous rounds of replication. There is hyper variable mini satellite DNA preferentially close to telomeres; and
3. origins of replication, consensus DNA sequences which bind the various proteins and enzymes required for replication. Each chromosome contains only a single molecule of double stranded DNA.


Homologous chromosomes are the pairs of chromosomes received one from each parent. They contain different genes (alleles) for the same traits in the same linear order. Chromatids are exact replicas of one chromosome and they are synthesized during the S period of the cell cycle. They are connected to one another at the centromere region until they separate at anaphase when the centromere region DNA is replicated.



Each gene occupies a specific locus (plural, loci). The locus is the gene's "address." Genes at the same locus on homologous chromosomes are called alleles (short for allelomorphs). Alleles are alternative forms of a gene otherwise known in the population as polymorphisms. They arise by mutations.

As a budding human geneticist it is important for you to understand that the only genetic disorders that can be detected by looking at chromosomes (karyotyping) are abnormalities involving changes in the number or structure of chromosomes. These include disorders such as trisomy 21, Down Syndrome, and structural rearrangements such as translocations, additions, and deletions. Some microdeletions can be detected by a procedure known as FISH (fluorescence in situ hybridization). Single gene defects cannot be detected by karyotyping an individual.

Even after a gene has been identified for a genetic disorder we may not be able to tell if a person or fetus has a mutation in that gene. An example of this is Marfan Syndrome which is often due to a new spontaneous mutation which can occur randomly anywhere within the fibrillin gene. Detection of genetic disorders is possible only when the (common) mutations within the responsible gene have been identified. It is possible to detect the sickle cell mutation because one single base change causes the disorder. Although sequencing of genes to find mutations is becoming more common it is still expensive and it may not be possible to distinguish a normal polymorphism from a harmful mutation. We will discuss this allelic heterogeneity frequently as we proceed with the course.

DNA Methylation

In eukaryotes, the most abundant covalent modification of DNA is methylation of cytosine residues at carbon 5 of the pyrimidine ring. This modification occurs primarily in the context of a simple sequence (5'-CG-3') called CpG islands and affects both strands of DNA. CpG methylation serves to increase the coding capacity of the genome—in essence, methylated carbon 5 serves as a 'fifth base' in DNA. Regions of the genome with high levels of methylated CpG di nucleotides include the inactive X chromosome in female mammals, imprinted genes and transposons and their relics, all of which are associated with stable transcriptional repression. How does the cell read this information? Additionally, as CpG methylation is strongly associated with regions of the genome subject to stable transcriptional repression, how do cells convert the information embedded in cytosine methylation into a functional state? An association between the methyl CpG binding protein MeCP2 and human Brahma (Brm), an ATPase subunit of the human SWI/SNF complex involved in chromatin remodeling has been identified. These findings establish a link between DNA methylation and chromatin structure and provide a new perspective on the mechanism of methylation-dependent gene regulation.

The information provided by CpG methylation in eukaryotic cells is interpreted, in most cases, by a conserved family of proteins that can interact specifically with methylated CpG di nucleotides. This methyl CpG binding domain (MBD) family of proteins is present in most eukaryotic organisms (a notable exception being yeast, which do not methylate DNA), and its interaction with methylated DNA has been rigorously characterized.




DNA in eukaryotes is packaged into nucleosomes which consist of DNA wrapped around histone proteins. Covalent modification of histones plays a critical regulatory role in controlling transcription, replication, and repair. Different histone modifications are recognized by different protein modules found in regulatory complexes with different, even antagonistic functions.