We’ve talked earlier about clades; a term which relates to how we can put influenza (flu) viruses that sit under the broad title, “A/H3N2”, into subgroupings. Clades are a way of describing similar viruses, based on their genetic sequences. Let’s step through the morass of flu, genes, clades and H3N2.
Before we get into A/H3N2 clades though, let’s
Genes are made of A, C, G and T
You’ve probably heard the DNA components that make up genes described as “building blocks” or bases. These bases are called adenine (A), guanine (G), cytosine (C) and thymine (T). Each base is chemically bound to a sugar molecule and a phosphate molecule. Altogether these 3 things are called a nucleotide.
Our genes include only 1-2% of the 3.2 billion nucleotide base pairs. That leaves a lot of genetic dark matter. Humans differ genetically by only about 0.1% from each other and about 1.2% (comparing the same aspects) from a chimpanzee.
How to view genetic differences
When we talk about nucleotide sequencing, we’re using lab tools to uncover the order of the sequence. In some instances, we know what changes in the sequence mean for the protein product, virus, disease etc., and in other instances – we don’t.
These sequences can then be used to view differences and similarities between viruses in distinct patients, or over time and across space.
Aligning the sequences
First, we align the sequences (see below). This post is about flu viruses so let’s talk about flu virus hemagglutinin (HA) gene sequences retrieved from infected humans. To align lots of different sequences in order to see if they are the same or different, we use computer software which takes each sequence (a row in the example below) and matches it to every other sequence chosen to be in the alignment.
The software looks through the nucleotides in each sequence and tries to match them up as closely as possible to those in the other sequences. There are different ways that the software can do this. Sometimes one or more sequences will have different nucleotides at a position compared to the average – or consensus – shared by most other sequences.
An alignment shows up these similarities and differences. But that can be hard to fully grasp by looking at the alignment alone. So we present the alignment graphically.
A tree from the forest
From a good alignment, we can make a good phylogenetic tree; these are diagrams that represent evolutionary relationships among the sequences and their donor organisms; viruses in this case.
Phylogenies are the evolutionary history of species. These can graphically show relationships
Below I made a quick phylogenetic tree from the alignment in the earlier example. It shows the relationship between just these different A/H3N2 HA gene segments collected from infected patients in 2015, 2017, 2018 and 2019. An example of a clade of viruses is highlighted in orange. All the sequences branch off a common node.
Something to be aware of: trees can be made from any length of nucleotide sequence. It’s important to know something about what you are aligning so you have an idea of whether the patterns you see in your tree are meaningful. Very short sequences or very closely related sequences may still appear quite different from other sequences in a tree. But are those patterns meaningful in real life? Answering that takes expert interpretation by those who know about the virus, virus gene, or human/monkey/whatever-the-phylogeny. It’s important to define the problem, but also to use the right tests for your hypotheses.
How genetically different are H3N2 clades?
The last thing to mention, and the original point of this post, was to look at how much the sequences – and by extension, the viruses (we presume) – in those H3N2 clades we keep reading about, differ genetically from each other.
To get an idea, I grabbed all the Australia-wide H3N2 HA gene segments on the GISAID EpiFlu database  with sample collection dates between January 2015 and May 2019, submitted by the WHO Collaborating Centre for Reference and Research on Influenza, Australia.
This was a bit over 3,300 sequences. I aligned them and got my software (Geneious v8) to compare how similar each sequence was to each other one. This produced a big matrix. If you open that in Excel and make it coloured, part of it looks like the image below.
To be a bit more data-centric, all of the 1,650 nucleotide long HA gene sequences share between 95% and 100% identity with each other. If we look at that in terms of the number of nucleotide differences instead of the percentage difference, there are between 0 and 77 nucleotide changes.
If we look at just one clade, 3c2.A A1b/131K (the one dominating in Australia in 2019), we see 98% to 100% nucleotide identity. So the HA genes within a clade are a bit more similar than those between clades. Which you’d expect, otherwise they wouldn’t fall into different clades!
How does all that translate?
That’s 👆 comedy gold. Because genes make proteins through a p
Anyhoo. This next bit is for extra credit.
If you take those nucleotide sequences and decode them using software – you end up with 3,300 HA protein sequences. This time we have an ordered sequence of amino acids, not nucleotides; the order decides what the protein is, how it’s modified and what it does. I realigned these amino acid sequences, made a new matrix and this time looked at the amino acid sequence identities.
The HA protein sequences – which are about 550 amino acids long – were also between 95% and 100% identical. If we look at the number of amino acids that differ instead of the percentage similarity, there are 0 to 30 differences between every possible pairing of those Australian strains across 4 and half years.
Little total change, potentially big impact
We’re not talking about a big percentage difference – at most 5% of the nucleotides – and yet in this gene, these differences are enough to represent H3N2 viruses that have escaped from vaccine protection. And that’s despite the vaccine component used to convey protection, sitting within that same range of nucleotide identity. Little changes can have big impacts. Just like little viruses can make us bigly sick I guess.
Because these changes occur in regions of the viral HA gene protein that are important for the binding of antibodies we make against the viruses, little changes do have a big effect. So even though most of the HA gene doesn’t change between all these viruses, enough has that we can visualise the difference using our phylogenetic tree. And we know that these differences can have important meanings.
If our immune response can’t quickly shut down an infection, then we get sick. That usually means we incubate a higher viral load (the amount of virus in our body) which can mean more virus coughed out into the air and onto surfaces infecting more people per infected person….and before you know it, you have yourself a seasonal flu epidemic.
Caveats and follow-ups:
HA is only one of 8 gene segments in the FluA virus genome. Although this post focusses on the sequences of Australian strains, I also went back to GISAID and added sequences from China, the United States and Vietnam spanning the same 2015-2019 period. I broke down the 18,900 HA gene sequences into about 8,500 (more manageable for my home computer) and found that they were between 94% and 100% identical to each other, which amounted to 0 to 91 nucleotide differences.
- What is DNA
- A Real Human Genome is 6.4 Billion Letters (base pairs) Long — Not 3.2 Billion
- The 99 Percent… of the Human Genome
- Khan Academy: Phylogenetic trees https://www.khanacademy.org/science/biology/her/tree-of-life/a/phylogenetic-trees
- The GISAID Initiative|Global Initiative on Sharing All Influenza Data