lassesen 10:59 pm on March 28, 2019

Data Science – Hunting for bacteria related to Symptoms

Some smart readers wrote saying that they are familiar with R but not the microbiome. This post is intended to kickstart them on a simple desired goal:

Which bacteria (or combination of bacteria) are associated with certain symptoms.

We have 240 symptoms and thousands of bacteria … oops, a big problem. We do not have a lot of microbiomes with symptoms list — so we are likely forced to deal with only the most common symptoms at present.

Information about bacteria is layered

A quick read on wikipedia on bacteria taxonomy may help. If you just grab the Taxonomy file, you will see the species counted in with it’s parent genus, who is counted in with family, etc… not the best scenario for analysis.

To simplify out matters, I have done extracts on each layer and dropped them at http://lassesen.com/ubiome/

So to load one layer, it’s just a few lines

library(tidyverse)
Species <-read_csv(“http://lassesen.com/ubiome/TaxonomyByspecies.csv”)
symptoms <-read_csv(“http://lassesen.com/ubiome/Symptoms.csv” )
samplesymptom <-read_csv(“http://lassesen.com/ubiome/SampleSymptom.csv”)
samplesWithTitle <- inner_join(samplesymptom, symptoms, by=”symptomId” )

Looking at this one, we find some 2,181 columns (species) and a sampleId (that matches that in symptoms).

From my prior post, you can see how we can get the most common symptom (symptomId=1,”General: Fatigue”) and then a list of sampleId with:

HasFatigue <-samplesWithTitle %>% filter(symptomId==1)

We need to only include observations that reported ANY symptoms, which is easy to do:

PeopleWithSymptoms <- samplesymptom %>% group_by(sampleId) %>% summarize(peopleWith=n()) %>% arrange(peopleWith)

Next we joins this with the level of taxonomy we wish to explore.

SpeciesWithSymptoms <-merge(PeopleWithSymptoms ,Species, by=”sampleId”)

We next put the sets together and set symptomId to 0 for those without fatigue. We have 47.5% of the records with Fatigue.

fulldata <- left_join(SpeciesWithSymptoms,HasFatigue,by=”sampleId”)
fulldata$symptomId =ifelse(is.na(fulldata$symptomId),0,fulldata$symptomId)
#We need to remove the symptom_name
drops <- c(“symptom_name”)
regdata <-fulldata[, !(names(fulldata) %in% drops)]

Picking the Model

I went with a simple linear model for the sake of illustration and toss everything at it!

test <- lm(symptomId ~ .,regdata)

We end up getting a rather long report back:

Where there is a + sign, it indicates increased risk of fatigue as the numbers of bacteria increases, -sign decreased risk of fatigue as the numbers of bacteria increases.

I can now redo it, picking just a few bacteria:

and then looking at the statistics, we see that there is no statistical significance for this relationship (we want to see a p-value < 0.001)

That’s it. No revelations — just an illustration of walking the path. All of the code for above is below.

library(tidyverse)
Species <-read_csv(“http://lassesen.com/ubiome/TaxonomyByspecies.csv”) symptoms <-read_csv(“http://lassesen.com/ubiome/Symptoms.csv” ) samplesymptom <-read_csv(“http://lassesen.com/ubiome/SampleSymptom.csv”) samplesWithTitle <- inner_join(samplesymptom, symptoms, by=”symptomId” ) HasFatigue <-samplesWithTitle %>% filter(symptomId==1)
PeopleWithSymptoms <- samplesymptom %>% group_by(sampleId) %>% summarize(peopleWith=n()) %>% arrange(peopleWith)
SpeciesWithSymptoms <-merge(PeopleWithSymptoms ,Species, by=”sampleId”)

fulldata <- left_join(SpeciesWithSymptoms,HasFatigue,by=”sampleId”)
fulldata$symptomId =ifelse(is.na(fulldata$symptomId),0,fulldata$symptomId)
drops <- c(“symptom_name”)
regdata <-fulldata[, !(names(fulldata) %in% drops)]
test <- lm(symptomId ~ .,regdata)

Doing a run with Phylum,

Phylum <-read_csv(“http://lassesen.com/ubiome/TaxonomyByPhylum.csv”) symptoms <-read_csv(“http://lassesen.com/ubiome/Symptoms.csv” ) samplesymptom <-read_csv(“http://lassesen.com/ubiome/SampleSymptom.csv”) samplesWithTitle <- inner_join(samplesymptom, symptoms, by=”symptomId” ) HasFatigue <-samplesWithTitle %>% filter(symptomId==1)
PeopleWithSymptoms <- samplesymptom %>% group_by(sampleId) %>% summarize(peopleWith=n()) %>% arrange(peopleWith)
PhylumWithSymptoms <-merge(PeopleWithSymptoms ,Phylum, by=”sampleId”)
fulldata <- left_join(PhylumWithSymptoms,HasFatigue,by=”sampleId”)
fulldata$symptomId =ifelse(is.na(fulldata$symptomId),0,fulldata$symptomId)
drops <- c(“symptom_name”)
regdata <-fulldata[, !(names(fulldata) %in% drops)]
test <- lm(symptomId ~ .,regdata)

We find statistical significance with 45% explained by the phylum.

Anton 8:05 am on February 2, 2022

Very interesting your whole website!! I am wondering: Are those relative abundances that you feed into the linear model? If so, then the analysis is not valid as bacterial abundance data is compositional. You could work with log ratios if you wanted to use classical statistical methods. Also, I was wondering if you correct for multiple testing in all your analyses? At least here it is not done. Finally: Even if you apply the statistics correctly, you cannot identify causal relationships. Bacteria may be lacking or maybe missing as a result of anything a patients does as a result of the symptom, e.g. such as taking a medication, avoiding a food or eating a certain food etc.. Without a causal inference approach, the best AI may not identify valid signals in the data. All it can do is identify statistical associations. Thus, a lacking enzyme in the microbiome of some diseased patient may very well not be the cause of the disease but can just as well be a consequence. Nevertheless, I find your work highly interesting and I will make a colleague of mine aware of your site who is very interested in this topic and might be of use for you given his knowledge on causal inference and downstream analyses of microbiome data….
- lassesen 4:58 am on February 20, 2022
  
  First, I do not use a linear model (tried it, lousy results). I have tried multiple models over the last few years and the current process have been incorporating classic fuzzy logic systems (did that professionally back in the early 1990’s) , Kaltoft-Moltrup algorithm to determine abnormalities. Correcting for multiple testing has challenges, mainly NSD (not sufficient data).
  
  I am interested if your colleague can get better results than what is shown here:
  https://microbiomeprescription.com/library/Associations?taxon=239935
  
  Using the data that is here:
  http://citizenscience.microbiomeprescription.com/
  - Anton 12:44 pm on February 23, 2022
    
    Right above this post you use a linear model. The lm function performs the simplest linear model there is.
    If you just generate as many random variables as you have species, 5% are significant just by pure chance. In the case of 1000 species, you will find 50 on average that are significant and those that are tend to have high effect sizes (type M error). Studies that do not apply some correction for that, are either not published or not taken seriously at all these days by leading experts. Given the number of tests you usually have to run, there are many methods to correct for that and it is well researched and straightfoward. In any case, without it, you might just as well relate random numbers to you outcome and you find similar results…I sent my colleague an email. I hope he finds the time…