Biology 442 - Human Genetics
Genome, DNA, Chromosomes & Gene Structure
On June 23, 2000, there was an announcement that the first draft of the
human book of life had been revealed. The resulting sequence is approximately
3.12 billion bases long and includes more than 99.9% of the human genome.
Some scientists had organized a "gene pool" where, for $1, you
could record a prediction of how many human genes there would be. The
best guesses ranged from 30,000 to 120,000. It is now believed that there
are approximately 22,000 protein coding genes..only one-third to one-fourth
as many as we once believed and only twice as many as those in Caenorhabditis
elegans, a nematode (round worm). The "final edition" of
the Human Genome was completed in 2003, the year of the 50th anniversary
of Watson and Crick's revelation of the structure of DNA in 1953.
However, even though the Human Genome Project was completed a decade
ago, exactly how many genes are in the human genome is still a mystery.
Genes are not in a linear sequence as was once thought. Protein coding
is in "exons" and these exons can be separated by DNA sequences
that don't code for proteins. Some proteins can be put together by "mix
and match" so that exons can be used to produce a multiplicity of
different proteins. Many of the are DNA sequences which do not code for
proteins play important and critical roles in the regulation and expression
of other genes. We are only beginning to understand how genes and their
products interact to produce the whole organism and how it functions.
But we have come a long way in a rather short span of time!
The Human Genome Project
General Aspects of the Human Genome Project
The Human Genome Project, an international effort to map all the human
chromosomes and also chromosomes of other organisms, began in 1990 and
was projected to be completed in 2003 but was completed three years ahead
of schedule partially because of the entry of Craig Venter in 1998. The
Human Genome Project began as a publicly funded, international consortium
of scientists led by Francis Collins. The funding came primarily from
the National Institutes of Health and the Department of Energy and also
from a British charity, the Wellcome Trust. Then in 1998, Craig Venter
(who had been at NIH) announced that his new company, Celera, could do
the job faster and cheaper. And he did. While much work remained to fill
in the gaps, this was an amazing accomplishment and it was done in an
amazingly short number of years.
Leroy Hood, a key player in the elucidation of the human genome, predicts
that within 5 years we will be able to sequence each persons genome within
a few hours for a reasonable cost. He foresees an era of Systems Biology
whereby we will be able to prevent disease by knowledge of each person's
unique genome and our accumulated knowledge of how genes are regulated
and how the protein products interact within the cell. A whole new field
of nanotechnology is underway and will lead to a new era in medicine which
personalizes health care. Since genes encode proteins, an essential ingredient
of this new systems approach to health, is knowledge of the proteome,
the constellation of all proteins in a cell. The understanding of gene
regulation, interactions of proteins, and environmental influences are
currently areas of intense research. Based on the knowledge of each person's
unique genome along with the knowledge of how our proteins function, we
will be able to predict what diseases each person is predisposed to and
it will be possible to administer personalized "medicines" that
will prevent the disorder.
and Some History
It took 100 years from the time the German scientist Friedrich Miescher
first isolated nucleic acids from pus in 1869 (white blood cells) taken
from bandages, for scientists to realize nucleic acids were the genetic
material. Avery, MacLeod and McCarty using bacteria (1944) and Hershey
and Chase (1952) using bacteriophage, provided evidence that DNA was the
genetic material. Watson and Crick in 1953 received the Nobel Prize for
providing the model of the structure of DNA showing how replication, coding
and mutation could be explained on the basis of their structure. And 50
years later, the entire human genome has been sequenced! One scientist
has compared this accomplishment to the 1543 publication of the first
book on human anatomy. Even though that book identified almost every part
of the human body, today we are still struggling to understand how many
of the parts work and how they interact. So the party has only begun!
The 1970's and 80's was a time of intense gene mapping and later gene
sequencing especially after the development of recombinant DNA techniques.
All of this led up to the Human Genome Project. In 2001, with 90% of the
genome sequenced, there was a major progress report (see photo at the
beginning of this page. The gaps were filled in in time for the 50th anniversary
of Watson and Crick's discovery of the structure of the DNA double helix
model. At that time 99% of the gene containing regions had been sequenced
to an accuracy of 99.999%. Everyone was very surprised when it was revealed
that the Human Genome contained only 25-35,000 genes....only about twice
as many as some insects and worms!
Only a very small amount of our DNA is responsible for the differences
among humans, indeed among all organisms. The genome is approximately
99.9% the same between individuals of all nationalities and backgrounds.
Over 50% of the human genome shows a high degree of sequence similarity
to genes in other organisms. Not only is there no correlation with the
amount of DNA and the complexity of the organism, most of our DNA is said
to be "junk" DNA. What is meant by "junk" is that
it does not code for proteins...the molecules that control all chemical
reactions and form most intra and extra cellular structures. We know some
of the functions of this non-coding DNA and undoubtedly we will continue
to find out more about its function as time goes on. Only 1% of our DNA
codes for proteins. The vast majority of our DNA is non-protein-coding,
and repetitive DNA sequences account for at least 50% of the non coding
DNA. The genome contains approximately 20,000-25,000 protein coding genes.
Many human genes are capable of making more than one protein, allowing
human cells to make perhaps 80,000-100,000 proteins from only 20,000-25,000
It is amazing to note that DNA was only found to be the genetic material
in 1944 when Avery, MacLeod and McCarty discovered that DNA isolated from
encapsulated pneumococci could be used to restore the capsule making ability
to mutant pneumococci not able to make a capsule. No other molecule (not
protein, RNA, carbohydrates) from the encapsulated bacteria could restore
this ability. In only a few more years in 1950 Watson and Crick announced
that they had uncovered the DNA as a double stranded helix. Their model
could explain coding, mutation, and replication. It was not very long
before the entire set of 64 codons and their corresponding amino acids
Similarities between groups of organisms
Many genes and gene alignments have been found to be common among organisms.
In spite of the variation in chromosome number there are a number of genetic
and physical linkages between single-copy genes that are remarkably conserved
amid a background of very rapidly evolving repetitive DNA sequences. The
term "orthologs" refers to similar genes in different species
that encode proteins with the same function. They originated from the
same gene in a common ancestor. "Paralogs" are genes/loci that
are homologous to other genes in the same species. They are, likely to
have originated from a common ancestral gene, for example, the alpha and
beta globin genes. In many cases these conserved genes and regions can
be identified in humans. There is now a classification system based on
orthologous relationships between genes which appears to be a natural
framework for comparative genomics and should facilitate both functional
annotation of genomes and large-scale evolutionary studies
Obviously, the genes of more closely related species are more similar.
Human DNA is 99.9% identical between individuals. We share genes and gene
alignments closely with the other primates, the chimpanzees, orangutans,
and gorillas, but we also share genes with bacteria and yeast as well
as other more primitive organisms. Evolution is a tinkerer, the same basic
material is used over and modified for new cellular functions.
Genomes of many other "model" organisms have been and are being
sequenced including bacteria, yeast, roundworms, fruit flies, fish, mice,
dog, and more. They are important because we can do sophisticated genetic
studies on them that we cannot ethically do with humans such as specific
matings, and inserting or removing genes. Important genes are highly conserved
from species to species. Over 50% of the human genome shows a high degree
of sequence similarity to genes in other organisms. Genes with sequence
similarity are called homologs (orthologs and paralogs are types of homologous
genes). For example, the "obese" (OB) gene which produces the
protein, leptin, that affects brain cells to suppress appetite and stimulate
metabolism of food was first discovered in mice. Mice deficient in leptin
The results of comparative genomics are proving to be interesting
in the elucidation of gene evolution and the similarities and differences
among organisms. We share about 96% of our genes with chimpanzees, 80%
with mice, 75% with dogs, 50% with the fruit fly, Drosophila, 40% with
roundworms (nematodes) and 30% with yeast. We even have about 100 genes
in common with many bacteria. This is all evidence of our evolutionary
past. Those genes we have in common with other organisms must be very
important. Mutated genes in humans also cause disease in fruit flies.
Of the mutated genes in 289 human disease conditions 61% are found in
the fruit fly. They include genes involved in prostate cancer, pancreatic
cancer, cardiac disease, cystic fibrosis, leukemia and others. What defines
us genetically is the complexity of how our genes are used and how these
genes interact with one another to carry out the myriad of functions and
give us the unique characteristics of humans. New discoveries occur on
a daily basis....keep your ears and eyes open!
Browse a Genome
There are coding and non coding sequences in nuclear DNA.
Twenty-five to thirty percent of our genome is unique or single copy
DNA that includes the genes that code for proteins. "Real" genes
that code for proteins have both coding (exons, start and stop codons)
and non coding DNA regions (promoters, introns, RNA processing signals).
We know that many of these non coding regions are functionally necessary.
The coding regions of genes are only a small proportion of the single
copy DNA sequences since eukaryotic genes have introns and other non coding
regions. There are also non coding regions between genes.
Much of the non coding DNA is highly or moderately repetitive. Repetitive
DNA can be dispersed or tandemly arranged. For much of this DNA the functions
have not been completely established. It is often called "satellite
DNA" because when centrifuged in a density gradient this DNA forms
bands separate from the bulk of genomic DNA. There are three satellite
bands (although only one shows up on the figure below). One is classic
satellite DNA (100-6500 bp repeats), minisatellite DNA (20-100 bp repeats),
and microsatellite DNA (CA)n repeats (n=2-10 bp)
Cesium chloride density gradient showing the main band
of DNA and the satellite (repetitive) DNA
One type of repetitive DNA codes for rRNA and tRNA which form gene clusters.
The rRNA genes in humans are found tandemly arranged on the p arms of
the five D and G groups chromosomes (13, 14, 15, 21, 22). These regions
are referred to as the NOR or nucleolar organizing regions and they form
the nucleolus of the interphase cell. The nucleolus has a fibrous portion
which is open rDNA being transcribed into rRNA and a granular region where
ribosomes are being assembled (the ribosomal proteins are made in the
cytoplasm and must be transported back into the nucleus).
ORGANIZATION OF THE HUMAN GENOME
Pseudogenes are related to functional genes
but are no longer capable of being transcribed or translated. One type
of pseudogene arose from duplications of genes which then acquired mutations
rendering them untranscribable or untranslatable. These pseudogenes cannot
be transcribed or translated due to the accumulation of fatal errors such
as a nonsense mutations or mutations in the promoter. Another type of
pseudogene is referred to as a retro pseudogene because it arose by reinsertion
of a cDNA made by a reverse transcriptase using an mRNA template. This
type contains no introns and no promoter region since these sequences
were spliced out of the original RNA transcript. Without a promoter, they
cannot be transcribed.
Two thirds of the repetitive non coding DNA sequences is in more complex
repeated sequences dispersed or scattered throughout the genome. These
can be further divided into short and long interspersed sequences, SINES
and LINES. They are mobile elements within our genome.
LINES are up to 7000 bp in length and represent about
4% of our total human genome. LINES contain a transposable element which
makes an RNA coding for reverse transcriptase. The transcriptase can make
cDNA from RNA which can reintegrate into another site. LINES are
found in the dark G bands of banded chromosomes. G bands are rich in AT
and, therefore, less CpG islands (CpG islands are
common in "housekeeping gene" promoters), and have fewer genes.
(The Y chromosome has more LINES than the X and the X has more than the
autosomes. There is less meiotic recombination where there are LINE elements.
SINES are shorter interspersed elements 90 to 500 bp
in length. One well known SINE is the Alu sequence which is about 300
bp in length. Alu sequences are unique to humans (and some apes), this
is the most frequent human SINE (approximately 5 x 105 copies,
3 - 6% of the total human genome). These SINES are named for the restriction
enzyme, Alu, which cuts at AGCT, commonly found in the repeat. Alu is
named for the bacterium, Arthrobacter luteus, from which the
restriction enzyme comes.
SINES can be found in introns, exons and extragenic sequences
These transposable elements can be a source of mutation. For example,
in some hemophilia A patients there is an insertion of an L1 sequence
into an exon. A patient with NF1 (neurofibromatosis type 1) contained
an inactivating Alu in the normal allele. These repetitive sequences can
play a role in rearrangements and gene duplication (e.g., beta globin
genes). Repetitive sequences are commonly found near sites of deletions
LTR (long terminal repeat) retro transposons
make up a large fraction of the typical mammalian genome. They comprise
about 8% of the human genome. On account of their abundance, LTR retro
transposons are believed to hold major significance for genome structure
and function. Long terminal repeats (LTR's) are, as the name suggests,
long repeating sequences of DNA several hundred nucleotides long found
at either side of pro viral DNA. LTR's are believed to play some role
in the integration of viral DNA into the host genome, as they are found
on retroviruses and transposons.
Mitochondrial DNA is a single double stranded circular
molecule. There are several copies in each mitochondrion and there are
many mitochondria in each of your cells. Mitochondria originated by
endosymbiosis of a prokaryotic cell early in the evolution of eukaryotic
cells. Mitochondrial DNA is similar to prokaryotic DNA. There are no
histones or any other protein associated with mt DNA. The genes contain
no introns. Because it is in a highly oxidizing environment it has a
much higher rate of mutations than nuclear DNA. The genes in mt DNA
code for mitochondrial ribosomes and transfer RNAs. Some genes code
for polypeptide subunits of the electron transport chain common to all
mitochondria. It relies on nuclear gene products for replication and
NON PROTEIN CODING DNA
REPETITIVE DNA ~ 50% OF OUR DNA
Genes may be unique sequences or belong to a gene family
such as the globins, actins, myosins, histones, tubulins that are repetitive.
Gene families refer to genes with similar DNA sequences which arose through
duplication of an ancestral gene followed by generations of mutations.
Gene families may be close to one another in clusters or they may be dispersed,
they may form a cluster on the same chromosome or they may be located
on different chromosomes. Hemoglobin is a tetramer composed of four peptide
chains, two alpha and two beta globins. The alpha globin gene cluster
is on human chromosome 16 and the related beta globin cluster is on chromosome
11. Examples of gene families include rDNAs, tDNAs, the histone genes,
P450 enzyme superfamily, hemoglobin genes, actin genes. Pseudogenes may
be part of a gene cluster or family. These gene duplicates are now evolutionary
Paralogs are the result of a gene duplication event arising
after speciation. Genes in two species that have directly evolved from
a single gene in the last common ancestor are called orthologs.
The classic macro satellite DNA has repeats of 100 to 6500 bp. This category
includes tandemly repeated satellite DNA from the centromeric repeats
(171 bp) unique to each chromosome and the telomeric repeats. The centromeric
repeats are referred to as alpha satellite DNA and each chromosome has
its unique sequence. Therefore, it is possible to make DNA probes specific
to each of our 24 different chromosomes. When a fluorescent label is added
to the probe, it is possible to count the number of each type of chromosome
even in an interphase cell. Therefore, it is possible to check for trisomies
in interphase amniotic fluid cells prior to culturing them for karyotyping.
Another type of tandemly repetitive DNA is referred to as mini satellite
sequences or VNTRs (variable number of tandem repeats). They are composed
of 20 to 100 bp repeats. The third type of tandemly repetitive DNA is
referred to as micro satellite sequences or STRs (short tandem repeats)
composed of 2 to 10 bp repeats. Since the number of repeats in micro and
mini satellites are highly variable (polymorphic) they are very useful
in gene mapping and DNA profiling for paternity testing, forensic testing,
confirmation of relatedness and dead body identification. Both VNTRs and
STRs are polymorphisms in non coding regions and are inherited in a codominant
pattern. They are formed by mutations which add or subtract the number
of repeats. Most individuals in the population are heterozygous at each
of these loci. There is hyper variable mini satellite DNA preferentially
close to telomeres which can cause misalignment which results in deletion
and duplication mutations.
Essential Conserved Non Coding DNA Sequences
Many DNA sequences that do not code for proteins are nevertheless
essential and their sequence must be conserved in order for them to serve
their function. These DNA sequences include promoter sites that bind RNA
polymerases, regulatory elements (enhancers, silencers, and locus control
regions LCRs) that bind regulatory proteins, the origin of replication
sites that bind the DNA replication complex, the centromeric DNA, the
telomere DNA, and many others.
The definition of a gene has evolved over time. It is no longer a "bead"
on a string nor is it merely a sequence of bases that codes for amino
acids in a single polypeptide chain. While the Beadle and Tatum model
of "one gene, one enzyme" is enticingly simple, we have had
to move on to acknowledge that genes are far more complex. There are both
coding and non coding regions or untranslated regions (UTRs) in the DNA
associated with genes. The non coding regions, as mentioned earlier, include
promoters, transcriptional regulatory sequences, introns and polyadenylation
signals. Post transcriptional processes that modify the initial RNA transcript
usually include 5' cap addition, 3' poly A addition, splicing out of introns
and sometimes, alternative splicing of introns to form different mRNAs
from the same gene. Introns are spliced out of transcribed RNA by a large
RNA protein complex, the spliceosome. Post translational cleavage of proteins,
while rare, can also occur as in the case of insulin and some hormones.
The use of alternative promoters is common and is used to generate cell
type specific mRNAs. These alternative promoters may be found within introns
of the gene. The human dystrophin (DMD) gene which has more than 79 exons
has at least eight different alternative promoters! In humans, the vast
majority of genes are transcribed individually and, in these cases, the
terms gene and transcription unit are essentially equivalent. The usual
linear order at a gene site is: regulatory element(s) (where enhancers
or suppressors bind); promoter region (where the RNA polymerase complex
binds); transcription start site (in 5' UTR) including CAP site; ATG,
translation initiation codon; exon(s) (variable number); introns (between
exons, 5'GT and 3'AG, variable number); 5' UTR consisting of a translation
stop codon (TAA, TGA, or TAG); AATAAA polyadenylation signal; and the
site for addition of poly A tail. Some genes have alternate splice sites
so that several different proteins can be produced from the multiple mRNAs
that are produced from the same gene.
Regulatory genes code for transcription factors. These proteins
may interact directly with DNA or with other transcription factors to
work as a complex. The purpose is to control gene expression....to turn
genes on or off or control the rate of mRNA production. The DNA binding
proteins often have similar DNA binding domains within them.
ORIGIN OF REPLICATION COMPLEX (ORC) AND REPLICATION COMPLEX
DNA binding motifs in regulatory proteins
3 alpha helices forming an L shape
DNA is unwound with a widened shallow minor groove
The bound SRY causes a 80o bend
Binds to specific sequence of bases A/TAACAAT/A
3-D NMR picture of SRY bound to DN
Exon skipping and splicing of mRNA to make more than one protein from
a single gene
The ability to make more than one gene product (polypeptide)
from a single gene explains in part how we can have many more gene products
than only the number of genes sequenced in the Human Genome project. Above
is an illustration of how the CGRP gene can make three different products.
Of course the genes that code for antibodies have long been known to "cut
and paste" to form the very large number of immunoglobins that we
are capable of making.
Some genes do not code for proteins
The genes that code for ribosomal RNAs and for the transfer
RNAs are not translated. Also the XIST (X inactivation specific transcript)
gene codes for an RNA that does not code for a protein. It makes an RNA
that interacts with the X chromosome sequences that are inactivated in
the second X chromosome of the female. There are three different RNA polymerases,
Pol I that transcribes ribosomal RNA genes, Pol II that transcribes protein
coding genes and snRNA, and Pol III that transcribes tRNAs. They each
have specific promoter regions to which they bind to begin transcription.
Pol III has a promoter site within the tRNA gene.
The classical view of a gene has been greatly altered. We
now know that a single region of DNA can be transcribed in a variety of
ways to produce many different RNAs. some coding for proteins and others
constituting regulatory RNAs. We have known for some time that protein
coding regions can overlap and that they can be read in both directions.
DNA sequences can produce a variety RNA transcripts used for multiple
functions. A new conception of the genome shifts the focus from genes
to transcripts.....away from the protein coding regions to the variety
of functional RNA transcripts...only some of which are mRNAs and includes
the new classes of functional RNAs as they are discovered.
It is intersting to note that many mutations in DNA that
are correlated with diseases (breast, prostate, and lung cancers, autism,
schizophrenia) have been found to be in regions of DNA that do not code
for proteins. At this time the specific function of many of these DNA
regions is unknown.
Formerly known as "Junk" DNA
The biggest surprise of decoding the human genome was how
few protein coding genes it contained. A surprisingly large amount of
DNA has been found to not code for proteins. This "extra" DNA
had been thought to be nonfunctional and was initially called "junk
DNA." Simpler organisms have less of this "junk" DNA. It
is now thought that this "non-coding" or "junk" DNA
plays an important role in gene regulation and the increasing complexity
We now know that 98% of the genome is transcribed into RNAs
and scientists are recognizing that the non-coding RNAs are playing important
roles in just about everything the cell does. These RNAs include snRNAs
(short nuclear), snoRNAs (short nucleolar)...both of which are located
within the nucleus. There are miRNAs (microRNAs), siRNAs (short interfering
RNAs) which can modify the activity of protein-coding genes. So in addition
to the regulatory proteins known to influence the activity of other genes,
we now know that there are many different RNAs that play a regulatory
role. In fact, in recent years scientists have been studying an extended
family of "non coding" RNA transcripts that play crucial roles
in cellular information control determining what proteins are made, their
conformation and where or when they are made. They are also emerging as
key to how the human brain functions.
Highlights from the Human Genome Project
1. The human genome consists of ~3.1 billion base pairs; 2.85 billion
have been fully sequenced
2. The genome is ~99.9% the same between individuals of all nationalities
3. Less than 2% of the genome codes for genes
4. The vast majority of our DNA is non-protein-coding, and repetitive
DNA sequences account for at least 50% of the non coding DNA
5. The genome contains ~ 20000-25000 protein coding genes.
6. Many human genes are capable of making more than one protein, allowing
human cells to make perhaps 80000 to 100000 proteins from only 20000-25000
7. Functions for over half of all human genes are unknown
8. Chromosome 1 contains the highest number of genes. The Y chromosome
contains the fewest genes
9. Over 50% of the human genome shows a high degree of sequence similarity
to genes in other organisms
10. Thousands of human disease genes have been identified and mapped to
their chromosomal locations
Search for the Genetic Material
Genome Project Information Site
Chromosomes (colored bodies) had been seen in the light microscope in
the nineteenth century and had been identified as units of heredity. However,
not until 1956 was the correct number of 46 for the human chromosomes
known. It was a serendipitous laboratory accident using hypotonic saline
which swelled the cells thereby allowing the chromosomes to separate sufficiently
to get an accurate count. At first the chromosomes were only identified
by relative length and the position of their centromeres. By these criteria,
they were separated into 7 groups, A, B, C, D, E, F, G plus the X and
Y sex chromosomes. Then in the 1960's staining techniques which produced
banding was introduced. This G-banding (Giemsa) made the identification
of each chromosome much easier and it also allowed us to see more subtle
chromosome structural changes. While the basis of G-banding is still not
known, we do know that the darker bands are AT rich, contain fewer genes,
and replicate late in the S phase of the cell cycle while the lighter
bands are GC rich, are gene rich, and replicate early in the S phase.
In 1971 at a conference in Paris, scientists got together to draw up a
system of numbering of the bands and sub bands. The result is the Paris
Conference ideogram. In making the assignments, however, they did make
one mistake. Chromosome 22 has more DNA than chromosome 21 and thus the
numbers should have been reversed. Since an extra chromosome 21 was already
associated with Down syndrome, there was a decision not to change the
Paris Conference ideogram.
HUMAN AND CHIMPANZEE CHROMOSOMES AND BANDING PATTERNS
ARE VERY SIMILA
In humans, all of our 3 billion base pairs and approximately 30,000 genes
are compacted and packaged into 23 pairs of chromosomes and 24 linkage
groups. In most animals and plants that reproduce sexually, chromosomes
come in pairs with one member of each pair from each of the two parents.
Each eukaryotic chromosome is composed of a single molecule of double
stranded DNA, 5 different histones, and some other non histone proteins.
The basic eukaryotic chromosome structure consists of DNA wrapped around
the evolutionarily conserved histones. There are five types of histones:
H1, H2A, H2B, H3 and H4. Approximately two turns of DNA wrap around an
octamer composed of two molecules each of H2A, H2B, H3 and H4. Histone
H1 binds in the region where the DNA enters and exits the nucleosome,
presumably stabilizing the DNA at this point. The histones contain a large
number of basic amino acids (lysine and arginine) which carry a positive
charge and which dampen the negative charge on the DNA molecule (PO4=).
Each nucleosome unit includes approximately 200 base pairs of DNA, with
about 146 of them wrapped around the octamer of histones. Although nucleosomes
must be opened before transcription can occur, the number of bases in
a nucleosomes is many less than required for a gene. Because they are
essential to the structure of chromosomes, histones must be replicated
along with the DNA during the S period of the cell cycle.
Centromeres are specialized regions within chromosomes that
play a critical role in the accurate segregation of duplicated chromosomes
during cell division. They are the site of kinetochore attachment necessary
for spindle attachment. Centromere nucleosomes contain an alternative
histone, CenH3, which is thought to define centromere identity and participate
in mitotic mechanics. A biochemical and biophysical analysis of centromere
nucleosomes in Drosophila nuclei revealed that CenH3 appears in a heterotypic
tetrameric half-sized nucleosome, with one copy each of CenH3, H2A, H2B,
and H4. These tetramers protect less DNA [~120 base pairs (bp)] than the
typical octomers (~150 bp) and do not seem to form as regular higher-order
structures as the octomer, yielding longer and more variable DNA linker
lengths. This looser chromatin conformation, embedded within heterochromatin,
may be critical for tethering the kinetochore to the centromere. (Dalal
et al, PLoS Biol. 5, e218 (2007).
Cells use various chemical modifications to histones to alternatively
expose or sequester genes, thus turning them on or off. There is a general
correlation between patterns of histone "decoration" and gene
activity. In particular, parts of chromosomes in which histones are covered
with acetyl groups tend to have transcriptionally active genes. Deacetylated
histones tend to harbor inactive genes. DNA near methylated histones is
generally shut down. Each histone has a "tail," a flexible string
of amino acids extending from the DNA-wrapped nucleosome. Acetyl and methyl
groups tend to attach to particular amino acids in the tails. The "tails"
have evolutionary conserved sequences, implying they are important. The
cell is known to have histone acetylases and deacetylases which are implicated
in turning genes on and off. New histone-tail decorations beyond methylation
and acetylation have now been identified.
Sugars or small proteins including ubiquitin can also mark histones.
Some scientists believe that modified histone tails act as sites for the
binding of other proteins that influence the accessibility of DNA for
gene activity. One such protein is heterochromatin protein1, a molecule
known to mediate the silencing of genes. It binds to the amino acid lysine
on the tail of histone H3 only if methyl groups adorn the lysine. Histone
methylation, unlike acetylation, appears to be stable and transferred
during mitosis. Since methylation is less transient, it appears to be
more involved in the long-term setting of the cell's genome (in differentiated
cells) while acetyl groups frequently hop on and off histone tails. Phosphate
groups attached to several amino acids on the tail of H3 indicate the
cell is dividing while a phosphate group on a particular serine on the
tail of H2B signals that the cell is about to commit suicide (apoptosis).
Late in spermatogenesis cysteine-rich protamines replace the histones.
They allow a higher degree of compaction of the DNA in the sperm.
A functional eukaryotic chromosome must contain the following
1. a centromere which contains satellite DNA unique to each chromosome
and the kinetochore which is a protein structure to which the spindle
2. a telomere at each end of the chromosome contains tandem repeats of
TTAGGG (3 - 20 kb) a special repetitive DNA necessary to prevent shortening
of the chromosome through the numerous rounds of replication. There is
hyper variable mini satellite DNA preferentially close to telomeres; and
3. origins of replication, consensus DNA sequences which bind the various
proteins and enzymes required for replication. Each chromosome contains
only a single molecule of double stranded DNA.
Homologous chromosomes are the pairs of chromosomes received one from each
parent. They contain different genes (alleles) for the same traits in the same
linear order. Chromatids are exact replicas of one chromosome and they are synthesized
during the S period of the cell cycle. They are connected to one another at the
centromere region until they separate at anaphase when the centromere region DNA
gene occupies a specific locus (plural, loci). The locus is the gene's "address."
Genes at the same locus on homologous chromosomes are called alleles (short for
allelomorphs). Alleles are alternative forms of a gene otherwise known in the
population as polymorphisms. They arise by mutations.
As a budding human
geneticist it is important for you to understand that the only genetic disorders
that can be detected by looking at chromosomes (karyotyping) are abnormalities
involving changes in the number or structure of chromosomes. These include disorders
such as trisomy 21, Down Syndrome, and structural rearrangements such as translocations,
additions, and deletions. Some microdeletions can be detected by a procedure known
as FISH (fluorescence in situ hybridization). Single gene defects cannot be detected
by karyotyping an individual.
Even after a gene has been identified for a genetic disorder we may not
be able to tell if a person or fetus has a mutation in that gene. An example
of this is Marfan Syndrome which is often due to a new spontaneous mutation
which can occur randomly anywhere within the fibrillin gene. Detection
of genetic disorders is possible only when the (common) mutations within
the responsible gene have been identified. It is possible to detect the
sickle cell mutation because one single base change causes the disorder.
Although sequencing of genes to find mutations is becoming more common
it is still expensive and it may not be possible to distinguish a normal
polymorphism from a harmful mutation. We will discuss this allelic heterogeneity
frequently as we proceed with the course.
In eukaryotes, the most abundant covalent modification of
DNA is methylation of cytosine residues at carbon 5 of the pyrimidine
ring. This modification occurs primarily in the context of a simple sequence
(5'-CG-3') called CpG islands and affects both strands of DNA. CpG methylation
serves to increase the coding capacity of the genome—in essence,
methylated carbon 5 serves as a 'fifth base' in DNA. Regions of the genome
with high levels of methylated CpG di nucleotides include the inactive
X chromosome in female mammals, imprinted genes and transposons and their
relics, all of which are associated with stable transcriptional repression.
How does the cell read this information? Additionally, as CpG methylation
is strongly associated with regions of the genome subject to stable transcriptional
repression, how do cells convert the information embedded in cytosine
methylation into a functional state? An association between the methyl
CpG binding protein MeCP2 and human Brahma (Brm), an ATPase subunit of
the human SWI/SNF complex involved in chromatin remodeling has been identified.
These findings establish a link between DNA methylation and chromatin
structure and provide a new perspective on the mechanism of methylation-dependent
The information provided by CpG methylation in eukaryotic cells is interpreted,
in most cases, by a conserved family of proteins that can interact specifically
with methylated CpG di nucleotides. This methyl CpG binding domain (MBD)
family of proteins is present in most eukaryotic organisms (a notable
exception being yeast, which do not methylate DNA), and its interaction
with methylated DNA has been rigorously characterized.
GENE REGULATION OF HISTONES
DNA in eukaryotes is packaged into nucleosomes which consist
of DNA wrapped around histone proteins. Covalent modification of histones
plays a critical regulatory role in controlling transcription, replication,
and repair. Different histone modifications are recognized by different
protein modules found in regulatory complexes with different, even antagonistic