Flu, genes, clades and H3N2

We’ve talked earlier about clades; a term which relates to how we can put influenza (flu) viruses that sit under the broad title, “A/H3N2”, into subgroupings. Clades are a way of describing similar viruses, based on their genetic sequences. Let’s step through the morass of flu, genes, clades and H3N2.

Before we get into A/H3N2 clades though, let’s refresh our memories about genes and sequences. I’ve previously had a go at writing a little on DNA structure if you feel like more reading.

Genes are made of A, C, G and T

You’ve probably heard the DNA components that make up genes described as “building blocks” or bases. These bases are called adenine (A), guanine (G), cytosine (C) and thymine (T).[1] Each base is chemically bound to a sugar molecule and a phosphate molecule. Altogether these 3 things are called a nucleotide.

Nucleotides make up genes and our genes (plus other stuff) comprise our genome. There are about 6.4 billion nucleotides in the human genome, existing in 3.2 billion nucleotide pairs.[2] Nucleotide pairs exist in a sequence – literally, one after another, after another …in sequence. The order of the sequence is responsible for what proteins our genes make and how their manufacture is regulated.

Our genes include only 1-2% of the 3.2 billion nucleotide base pairs.[1] That leaves a lot of genetic dark matter. Humans differ genetically by only about 0.1% from each other and about 1.2% (comparing the same aspects) from a chimpanzee.[4]

How to view genetic differences

When we talk about nucleotide sequencing, we’re using lab tools to uncover the order of the sequence. In some instances, we know what changes in the sequence mean for the protein product, virus, disease etc., and in other instances – we don’t.

An example nucleotide sequence from a virus-positive patient’s sample. The lab method we use produces the sequence through a PCR-based process combined with fluorescent signal generation. The end result is a string of As, Cs, Gs and Ts in an order that is unique to the virus being analysed. This is conversationally called “sequencing”. It’s the first step needed to understand flu, genes, clades and H3N2.

These sequences can then be used to view differences and similarities between viruses in distinct patients, or over time and across space.

Aligning the sequences

First, we align the sequences (see below). This post is about flu viruses so let’s talk about flu virus hemagglutinin (HA) gene sequences retrieved from infected humans. To align lots of different sequences in order to see if they are the same or different, we use computer software which takes each sequence (a row in the example below) and matches it to every other sequence chosen to be in the alignment.

A nucleotide sequence alignment of 12 Australian partial hemagglutinin gene (HA) sequences from A/H3N2 viruses detected from ill from South Australians. Sequences were aligned using software (Geneious v8.1.9). The flu virus called A/South_Australia/84/2019 has been selected (pale yellow highlight) and all the other viruses are being compared to it. Each little black line represents a difference between a nucleotide in one of the other sequences and the highlighted sequence. These sequences were submitted by the World Health Organization Collaborating Centre for Reference and Research on Influenza, Australia, and are housed on the GISAID EpiFlu database.

The software looks through the nucleotides in each sequence and tries to match them up as closely as possible to those in the other sequences. There are different ways that the software can do this. Sometimes one or more sequences will have different nucleotides at a position compared to the average – or consensus – shared by most other sequences.

An alignment shows up these similarities and differences. But that can be hard to fully grasp by looking at the alignment alone. So we present the alignment graphically.

A tree from the forest

From a good alignment, we can make a good phylogenetic tree; these are diagrams that represent evolutionary relationships among the sequences and their donor organisms; viruses in this case.

Phylogenies are the evolutionary history of species.[5] These can graphically show relationships between those aligned sequences. Including which ones are highly similar to others, and which differ.

The pattern in which the branches connect represents our understanding of how the species in the tree evolved from a series of common ancestors. Each branch point (also called an internal node; yellow dots highlight 2 examples) represents a divergence event or splitting apart of a single group into descendant groups. Descendents that all branch from a single common node can be said to belong to the same clade.
Source: Khan Academy.[5]

Below I made a quick phylogenetic tree from the alignment in the earlier example. It shows the relationship between just these different A/H3N2 HA gene segments collected from infected patients in 2015, 2017, 2018 and 2019. An example of a clade of viruses is highlighted in orange. All the sequences branch off a common node.

Neighbor-Joining tree made using A/South_Australia/85/2018 as an outgroup (top sequence). Pink circles mark nodes. Numbers indicate bootstrap support for the pattern of branching that flows from that node. Orange branches highlight an example of a clade which consists of a group of virus sequences that all branch from one common ancestor at the node marked with “100”. Interestingly, this clade is made up of all the sequences from 2019.
Sequences were submitted by the WHO Collaborating Centre for Reference and Research on Influenza, Australia, and are housed on the GISAID EpiFlu database.[6]

Something to be aware of: trees can be made from any length of nucleotide sequence. It’s important to know something about what you are aligning so you have an idea of whether the patterns you see in your tree are meaningful. Very short sequences or very closely related sequences may still appear quite different from other sequences in a tree. But are those patterns meaningful in real life? Answering that takes expert interpretation by those who know about the virus, virus gene, or human/monkey/whatever-the-phylogeny. It’s important to define the problem, but also to use the right tests for your hypotheses.

How genetically different are H3N2 clades?

The last thing to mention, and the original point of this post, was to look at how much the sequences – and by extension, the viruses (we presume) – in those H3N2 clades we keep reading about, differ genetically from each other.

To get an idea, I grabbed all the Australia-wide H3N2 HA gene segments on the GISAID EpiFlu database [6] with sample collection dates between January 2015 and May 2019, submitted by the WHO Collaborating Centre for Reference and Research on Influenza, Australia.

This was a bit over 3,300 sequences. I aligned them and got my software (Geneious v8) to compare how similar each sequence was to each other one. This produced a big matrix. If you open that in Excel and make it coloured, part of it looks like the image below.

A small part of a big matrix comparing – in percent and coloured up using Microsoft Excel – how similar the HA nucleotide sequences of different Australian influenza viruses are to each other between 2015 to 2019. Groups of similar viruses show up as more red, and those sharing less similarity are more green. These sequences were submitted by the WHO Collaborating Centre for Reference and Research on Influenza, Australia, and are housed on the GISAID EpiFlu database.

To be a bit more data-centric, all of the 1,650 nucleotide long HA gene sequences share between 95% and 100% identity with each other. If we look at that in terms of the number of nucleotide differences instead of the percentage difference, there are between 0 and 77 nucleotide changes.

If we look at just one clade, 3c2.A A1b/131K (the one dominating in Australia in 2019), we see 98% to 100% nucleotide identity. So the HA genes within a clade are a bit more similar than those between clades. Which you’d expect, otherwise they wouldn’t fall into different clades!

How does all that translate?

That’s 👆 comedy gold. Because genes make proteins through a process called translation..and..so…okay. I’m sad.

Anyhoo. This next bit on flu, genes, clades and H3N2 is for extra credit.

If you take those nucleotide sequences and decode them using software – you end up with 3,300 HA protein sequences. This time we have an ordered sequence of amino acids, not nucleotides; the order decides what the protein is, how it’s modified and what it does. I realigned these amino acid sequences, made a new matrix and this time looked at the amino acid sequence identities.

The HA protein sequences – which are about 550 amino acids long – were also between 95% and 100% identical. If we look at the number of amino acids that differ instead of the percentage similarity, there are 0 to 30 differences between every possible pairing of those Australian strains across 4 and half years.

Little total change, potentially big impact

We’re not talking about a big percentage difference – at most 5% of the nucleotides – and yet in this gene, these differences are enough to represent H3N2 viruses that have escaped from vaccine protection. And that’s despite the vaccine component used to convey protection, sitting within that same range of nucleotide identity. Little changes can have big impacts. Just like little viruses can make us bigly sick I guess.

Because these changes occur in regions of the viral HA gene protein that are important for the binding of antibodies we make against the viruses, little changes do have a big effect. So even though most of the HA gene doesn’t change between all these viruses, enough has that we can visualise the difference using our phylogenetic tree. And we know that these differences can have important meanings.

If our immune response can’t quickly shut down an infection, then we get sick. That usually means we incubate a higher viral load (the amount of virus in our body) which can mean more virus coughed out into the air and onto surfaces infecting more people per infected person….and before you know it, you have yourself a seasonal flu epidemic.

Caveats and follow-ups:

HA is only one of 8 gene segments in the FluA virus genome. Although this post focusses on the sequences of Australian strains, I also went back to GISAID and added sequences from China, the United States and Vietnam spanning the same 2015-2019 period. I broke down the 18,900 HA gene sequences into about 8,500 (more manageable for my home computer) and found that they were between 94% and 100% identical to each other, which amounted to 0 to 91 nucleotide differences.


  1. What is DNA
  2. A Real Human Genome is 6.4 Billion Letters (base pairs) Long — Not 3.2 Billion
  3. The 99 Percent… of the Human Genome
  4. Genetics
  5. Khan Academy: Phylogenetic trees https://www.khanacademy.org/science/biology/her/tree-of-life/a/phylogenetic-trees
  6. The GISAID Initiative|Global Initiative on Sharing All Influenza Data

Views: 1814