GENETIC
NOMENCLATURE FOR
Caenorhabditis
elegans
Genetic nomenclature for Caenorhabditis elegans is supervised by the Caenorhabditis Genetics Center, as part of a contract from the US NIH National Center for Research Resources. The curator for Genetic Mapping and Genetic Nomenclature is: Dr Jonathan Hodgkin (Genetics Unit, Department of Biochemistry, University of Oxford, UK), email: jah@bioch.ox.ac.uk
Investigators wishing to register new gene names
for C. elegans should note
the summary guidelines below.
The CGC also assigns specific
identifying designations to each laboratory engaged in dedicated long-term
genetic research on C. elegans.
Each such laboratory is assigned a
lab/strain code, for naming strains, and an allele code, for naming mutations
(see: http://biosci.umn.edu/CGC/Nomenclature/code.htm). Investigators requiring new CGC
designations should apply to jah@bioch.ox.ac.uk.
1. Gene names must conform to the standard format of 3
letters, hyphen, number.
2. Genes can be named on the basis of a mutant
phenotype or on the basis of the predicted protein product or RNA
product.
3. If a new gene clearly belongs in an existing gene
class (of which more than 1000 now exist), then a new gene number will be
assigned after consultation with the laboratory responsible for the gene class
in question. Gene classes and the
corresponding assigning laboratories for each gene class are listed on WormBase
<http://www.wormbase.org/>
and at the CGC
<http://biosci.umn.edu/CGC/Nomenclature/genes.htm>.
4. If the establishment of a new gene class name
seems more appropriate, then approval for this name must be obtained from the
CGC, preferably by e-mail application to the CGC Genetic Map and Nomenclature
Curator
<jah@bioch.ox.ac.uk>
5. Gene names based on homology with a previously
named gene in another well-studied organism, such as Saccharomyces
cerevisiae or Mus
musculus, are often appropriate and desirable, especially
where there is convincing orthology between genes.
6. Gene names and gene numbering schemes that conform
to established nomenclature proposals for particular protein classes are
desirable.
7. Gene names that are memorable, informative and
simply explained are encouraged.
8. Gene names based solely on RNAi phenotypes are
discouraged.
9. Gene names including c (for Caenorhabditis), ce (for C. elegans), n (for nematode) or w (for worm) are discouraged. C. elegans as the organism of origin can be specified with a prefix (Ce-) if desired.
10. New gene name classes can be assigned in confidence, prior to
formal publication or disclosure in an abstract.
This summary is based on the original
proposals for C.
elegans nomenclature
(Horvitz et
al., 1979 Mol.
Gen. Genet. 175: 129-133), plus additional recommendations that have been
distributed in The Worm Breeder's Gazette.
Genetic
loci
Genes are given names consisting
of three italicized letters, a hyphen, and an italicized Arabic
number, e.g., dpy-5 or let-37 or mlc-3. The gene name may be followed by an
italicized Roman numeral, to indicate the linkage group on which the gene maps,
e.g., dpy-5 I or
let-37 X or mlc-3 III.
For genes defined by mutation,
the gene names refer to the mutant phenotype originally detected or most easily
scored: dumpy (dumpy) in the case of dpy-5,
lethal (lethal) in the case of let-37.
For genes defined by cloning on
the basis of sequence similarity, the gene name refers to the predicted protein
product or RNA product:
myosin light
chain in the case
of mlc-3,
superoxide
dismutase in the
case of sod-1,
ribosomal RNA in the case of rrn-1.
Genes with related properties are
usually given the same three letter name and different numbers. For example,
there are three known myosin light chain genes: mlc-1, mlc-2, mlc-3,
and more than twenty different dumpy genes: dpy-1, dpy-2, dpy-3,
and so on.
Genes can be given names
corresponding to homologous named genes in other standard genetic
organisms.
Examples: rnt-1 is the C. elegans ortholog of the Drosophila gene
runt.
wrn-1 is the C. elegans ortholog of the human gene WRN1, responsible for
Werner’s syndrome.
Gene names that are memorable,
informative and simply explained are encouraged.
Genes in a paralogous set related
to a single named gene in another organism are sometimes given the same gene
name and number, followed by a distinguishing decimal. Example: four C.
elegans genes homologous to SIR2 in S. cerevisiae have been given the names sir-2.1, sir-2.2, sir-2.3,
sir-2.4.
Pseudogenes, for which there is
good evidence that no functional product is ever generated, can be indicated by
adding the optional italic suffix ps to the gene name, as in msp-48ps.
Gene names based solely on RNAi
phenotypes are discouraged.
Gene names including c (for Caenorhabditis), ce (for C. elegans), n (for nematode) or w (for worm) are discouraged.
Gene names that have been established in the published literature and databases should preferably not be changed.
In cases where a gene has received multiple names, one name will be adopted as the main name for the gene. Other names will continue to be listed in databases. Whenever possible, name changes or the adoption of a single main name should be made with the approval of all laboratories concerned.
Homologous
genes
If a homolog of a known C.
elegans gene is identified in a
related species such as Caenorhabditis briggsae, this can be given the same gene name, preceded by two
italic letters referring to the species, and a hyphen. For example,
Cb-tra-1 is the name for the C.
briggsae homolog of the C. elegans gene tra-1.
The C. elegans homolog of a gene identified and named in
another organism can be distinguished by the same convention, using
"Ce-" as an optional prefix. For
example, Ce-snt-1 defines the C. elegans synaptotagmin gene.
Alleles and
mutations
Every mutation has a unique
designation. Mutations are given names consisting of one or two italicized
letters followed by an italicized Arabic number, e.g., e61 or mn138 or st5.
The letter prefix refers to the laboratory of isolation, as registered with the
Caenorhabditis Genetics Center. There are currently more than 350 registered
laboratories. For example, e
refers (originally) to the MRC
Laboratory of Molecular Biology (Cambridge, U.K.), (currently) to the
laboratory of J. Hodgkin
(University of Oxford), and st refers to the laboratory of R.H. Waterston (Washington
University, St. Louis, MO).
When gene and mutation names are
used together, the mutation name is included in parentheses after the gene
name, e.g., dpy-5(e61), let-37(mn138). When unambiguous (e.g., if only one mutation
is known for a given gene or if all work on a gene described in a publication
used a single mutation cited in a Methods section), gene names are used in
preference to mutation names (let-37 rather than mn138 or let-37(mn138)).
Optional suffixes indicating
characteristics of a mutation can follow a mutation name. These are usually two-letter
nonitalicized letters, e.g., hc17ts, where ts stands for temperature-sensitive, or
pk15te, where te stands for
transposon-excision.
Mutations created by
in vitro mutagenesis should receive standard allele
names. For cases where a pre-existing genomic mutation is re-created by in
vitro mutagenesis, it is still
desirable to give the new mutation a new name.
The wild-type allele of a gene is
defined as that present in the Bristol N2 strain, stored frozen at the CGC and
other locations. Wild-type alleles can be designated by a plus sign immediately
after the gene name, dpy-5+,
or, more commonly, by including the plus sign in parentheses,
dpy-5(+).
There is no special nomenclature
for modifier mutations. Many extragenic suppressor loci are called
sup (40 loci defined so far, with a wide variety of
properties and mechanisms). An increasing number of more specific modifier gene
classes have been established, such as smu (suppressor
of mec and
unc), and smg (suppressor
with morphogenetic effect on
genitalia) and
sel (suppressor/enhancer of lin-12).
Intragenic suppressors or
modifiers are indicated by adding a second mutation name within parentheses;
for example, unc-17(e245e2608)
is an intragenic partial revertant of unc-17(e245).
Mutations known to be chromosomal
rearrangements, rather than intragenic lesions, are named differently, as
described below.
DNA
sequences
There are no specific
recommendations for designating cloned sequences that are not similar to known
genes. Most genomic clones have been provided by the C.
elegans mapping/sequencing consortium (based at the
Sanger Centre, Cambridge, UK, and the Genome Sequencing Center, St. Louis,
USA). Cosmid clones generated by the consortium are named on the basis of the
vector, either pJB8 (initial letters B, C, D, E, R, M, ZC) or a Lorist vector
(initial letters K, T, W, F, ZK). Phage clones (in Lambda 2001) are identified
by the initial letters A, ZL, YSL.
YACs (yeast artificial chromosome clones) are identified by the initial letter Y, e.g., Y3D5. YAC subsequences may be given names derived from the initial YAC name. Example: subsequences derived from the YAC Y47H9 have been called Y47H9A, Y47H9B, Y47H9C. Note that physical clones corresponding to these subsequences are not available.
Genomic DNA clones that have not
been generated by the consortium are usually designated by the laboratory
strain designation (see below), a # symbol and an isolation number, e.g.,
MT#JAL6.
Sequences that are predicted to be genes from sequence data alone are initially named by the consortium on the basis of the sequenced cosmid, plus a number. For example, the genes predicted for the cosmid T05G3 are called T05G3.1, T05G3.2, etc. (numbered in arbitrary order of definition). Such names can be superseded by standard 3-letter names when this becomes appropriate. Thus, R13F6.3 has been given the name srg-12 (for serpentine receptor, class gamma).
EST (Expressed Sequence Tag)
clones have received names with prefixes such as cm and
yk.
RFLPs and
SNPs
Polymorphic sites, which are
mostly RFLPs (restriction fragment length polymorphisms) or SNPs (single
nucleotide polymorphisms) , are designated by an italic letter
P and an italic number, preceded by the allele
prefix for the laboratory responsible for identifying the site.
Examples:
stP17 and stP196 are RFLPs identified in the laboratory of R. H. Waterston
,
amP9 and
amP15 are SNPs identified in the laboratory of K.
Kornfeld.
Transgenes
Transformation of C.
elegans with exogenous DNA by microinjection usually
leads to the formation of a transmissible extrachromosomal array containing
many copies of the introduced DNA.
Sometimes chromosomal integration of the introduced DNA can occur, or an
existing extrachromosomal array can be integrated after irradiation of a
transgenic line.
Extrachromosomal arrays are given
italicized names consisting of the laboratory allele prefix, the two
letters Ex, and a number.
Integrated transgenes are
designated by italicized names consisting of the laboratory allele prefix, the
two letters Is, and a
number.
Both Ex and Is can optionally be followed by genotypic or molecular
information describing the transgene, in square brackets. For
example, eEx3 or eIs2 or stEx5[sup-7(st5) unc-22(+)].
Gene fusions incorporated in
transgenes that consist of a C. elegans gene or part thereof fused to a reporter such as lacZ or
GFP are indicated by the C. elegans gene name
followed by two colons and the reporter, all italicized: pes-1::lacZ, mab-9::GFP.
Genotypes
Mutants carrying more than one
mutation are designated by sequentially listing mutant genes or mutations
according to the left-right (= up-down) order on the genetic map. Different
linkage groups are separated by a semicolon and given in the order
I, II, III, IV, V, X, f. I-V are
the five autosomes, X is the
X chromosome, and
f refers to free duplications or chromosomal
fragments. For example: dpy-5(e61) I; bli-2(e768) II; unc-32(e189)
III.
Heterozygotes, with allelic
differences between chromosomes, are designated by separating mutations on the
two homologous chromosomes with a slash. Where unambiguous, wild-type alleles
can be designated by a plus sign alone, or even omitted. For example,
dpy-5(e61)
unc-13(+)/dpy-5(+) unc-13(e51) I
can also be written dpy-5 +/+ unc-13 or dpy-5/unc-13.
Transposons
C. elegans transposons are called Tc1, Tc2, etc., where
each number represents a different family. Transposon names are not italicized
except when included in a genotype. Transposon insertions in genes are
indicated by adding ::Tc to
the relevant mutation name, as an optional descriptor. Thus, a mutation of the
gene unc-54 called
r293 is a Tc1 insertion, and can therefore be
written unc-54(r293::Tc1).
Duplications
(Dp) deficiencies (Df), inversions (In) and translocations (T) are known in C. elegans cytogenetics; these are given italicized names consisting
of the laboratory mutation prefix, the relevant abbreviation, and a number,
optionally followed by the affected linkage groups in parentheses
(e.g., eT1(III;V), mnDp5(X;f), where f
indicates a free duplication). Chromosomal balancers of unknown structure can
be designated using the abbreviation C, e.g., mnC1(II).
The
mitochondrial genome
The mitochondrial genotype of a worm can be expressed using the standard nomenclature, using M as the abbreviation for the mitochondrial linkage group. The mitochondrial genotype is written as the last element in the genotype, following the nuclear genotype. Heteroplasmic combinations, where mitochondria of different genotypes co-exist in the same cytoplasm, can be expressed using a double forward slash, //. For example, "uaDf5//+".
Proteins
The protein product of a gene can
be referred to by the relevant gene name, written in non-italic capitals, e.g.,
the protein encoded by unc-13
can be called UNC-13. Where more than one protein product is predicted for a
gene (usually as a result of alternative message processing), the different
proteins are distinguished by additional capital letters, e.g., TRA-1A, TRA-1B.
Mutant protein products can be
named by the missense change, for example a mutant TRA-1A protein with a Pro to
Leu change at codon 79 would be written: TRA-1A (P79L).
RNA
molecules
Messenger RNA species can be
written by using the protein product as a descriptor, for example TRA-1A mRNA,
TRA-1B mRNA, in order to allow distinction between different splice
variants.
Non-coding RNA species can be
written using the gene name as a descriptor, for example lin-4
RNA.
Small RNA species derived from mir genes
(micro-RNAs) can be written miR-,
followed by a number corresponding to the mir gene.
Example: miR-2 for the RNA
derived from mir-2.
Phenotypes
Phenotypic characteristics can be
described in words, e.g., dumpy animals or uncoordinated animals. If more
convenient, a nonitalicized three-letter abbreviation, which usually
corresponds to a gene name, may be used. The first letter of a phenotypic
abbreviation is capitalized, e.g., Unc for uncoordinated, Dpy for dumpy. If
necessary to distinguish among related but distinguishable phenotypes, the
relevant gene number can be added, e.g., Unc-4 and Unc-13 to differentiate the
distinct phenotypes produced by mutations in the two genes
unc-4 and unc-13. Abbreviations that do not correspond to gene names can
also be used, e.g., Muv for multiple vulval development.
A common and accepted convention, when comparing a mutant with the wild-type, is to use the prefix non- to refer to the wild-type phenotypes, for example, non-Lin (= wild type cell lineage) or Dpy non-Unc (= wild type with respect to movement, but dumpy with respect to body shape).
RNAi
phenotypes
Animals in which an endogenous
gene has been down-regulated by RNA interference (RNAi), after exposure to
double-stranded RNA corresponding to that gene, can be referred to as mutants,
using italicized RNAi as the
mutation name. Example:
mog-4(RNAi).
Phenotypes induced by RNAi can be
named using conventional mutant phenotype descriptors, such as Unc, Muv,
Fem. For high throughput
RNAi screens, which may detect only conspicuous phenotypes, a limited set of
about forty standard phenotype descriptors has been established (see list on
WormBase).
Strains
A strain is a set of individuals
of a particular genotype with the capacity to produce more individuals of the
same genotype. Strains are given nonitalicized names consisting of two
uppercase letters followed by a number. The strain letter prefixes refer to the
laboratory of origin and are distinct from the mutation letter
prefixes.
Examples: CB1833 is a strain of
genotype dpy-5(e61) unc-13(e51), originally constructed by S. Brenner at the MRC Laboratory of
Molecular Biology (strain prefix CB, allele prefix e),
and MT688 is a strain of genotype
unc-32(e189)
+/+ lin-12(n137) III; him-5(e1467) V, constructed in the laboratory of H.R. Horvitz at M.I.T.
(strain prefix MT, allele prefix n).
Strain prefixes are listed
at:
http://biosci.umn.edu/CGC/Nomenclature/code.htm.
Some 3-letter laboratory
designations are also in use, mainly to refer to strains of nematode species
other than C. elegans.
Strains can and should be
preserved as frozen stocks at –70˚ C or ideally in liquid
nitrogen, in
order to ensure long-term maintenance and to avoid drift or accumulation of
modifier mutations.
Sources
All genetic data for C.
elegans are summarized in the
ACeDB interactive database (Eeckman and Durbin, 1995 Methods in Cell
Biology 48: 584-605) and WormBase.
Queries on recommended
nomenclature for C. elegans
should be addressed to:
J. Hodgkin
Genetics Unit, Department
of Biochemistry, University of Oxford, Oxford OX1 3QU, UK
Tel +44 1865 275317
Fax +44 1865
275318
email:
jah@bioch.ox.ac.uk
or
R. K. Herman
Caenorhabditis Genetics Center,
University of Minnesota, 1445 Gortner Avenue, St. Paul, MN 55108,
USA
Tel: +1 612 624 6203
Fax: +1 612 625
5754
email: bob-h@biosci.cbs.umn.edu