Data Science: Tax Rank and Symptoms

This is another one of my series of post dealing with using R and the microbiome samples submitted at

and available for download at: .  The intent is show process (“boot strap”) individuals to explore more. There is a lot to be explore (and I have a day job and heavy family obligations).

Getting data into your R workspace

The code below will download and save the data to your local computer. WARNING: WordPress does some funky things with the quotes(“), you may have to correct if you copy and paste.

samplesymptom <-read_csv(“;)
samplesWithSymptoms <- samplesymptom %>% group_by(sampleId) %>% summarize(peopleWith=n()) %>% arrange(desc(peopleWith)) %>% select(sampleId)
load all tax rank sets with reasonable amount of data and then filter to those observations with symptoms
byPhylum <-inner_join(read_csv(“;),samplesWithSymptoms,by=”sampleId”)
byClass <-inner_join(read_csv(“;),samplesWithSymptoms,by=”sampleId”)
byFamily <-inner_join(read_csv(“;),samplesWithSymptoms,by=”sampleId”)
byGenus <-inner_join(read_csv(“;),samplesWithSymptoms,by=”sampleId”)
byKingdom <-inner_join(read_csv(“;),samplesWithSymptoms,by=”sampleId”)
byNoRank <-inner_join(read_csv(“;),samplesWithSymptoms,by=”sampleId”)
byOrder <-inner_join(read_csv(“;),samplesWithSymptoms,by=”sampleId”)
bySpecies <-inner_join(read_csv(“;),samplesWithSymptoms,by=”sampleId”)
bySubClass <-inner_join(read_csv(“;),samplesWithSymptoms,by=”sampleId”)
bySubOrder <-inner_join(read_csv(“;),samplesWithSymptoms,by=”sampleId”)
bySubPhylum <-inner_join(read_csv(“;),samplesWithSymptoms,by=”sampleId”)
bySuperKingdom <-inner_join(read_csv(“;),samplesWithSymptoms,by=”sampleId”)
bySuperPhylum <-inner_join(read_csv(“;),samplesWithSymptoms,by=”sampleId”)
create lists/vectors with labels so we can walk over each later
TaxRankCount <-13
TaxRankList <- c(byClass,byFamily,byGenus,byKingdom,byNoRank,byOrder,byPhylum,bySpecies,bySubClass,bySubOrder,bySubPhylum,bySuperKingdom,bySuperPhylum)
TaxRankLabel <- c(“byClass”,”byFamily”,”byGenus”,”byKingdom”,”byNoRank”,”byOrder”,”byPhylum”,”bySpecies”,”bySubClass”,”bySubOrder”,”bySubPhylum”,”bySuperKingdom”,”bySuperPhylum”)
read in symptom names
symptoms <-read_csv(“; )

Above as image

If you are using R-Studio, you will see a nice report on your right

The following code will generate a listing of symptom frequency.

SymptomsWithCount % group_by(symptomId,symptom_name) %>% group_by(symptomId,symptom_name) %>% summarize(peopleWith=n()) %>% arrange(desc(peopleWith)) head(SymptomsWithCount)

In my prior post,Building a model of Symptoms and Symptoms Analysis with R, I showed the basic process. I will try to reduce the amount of typing by adding a bit of code.

One Line of Code After …

The code below loads the “broom” library to bewitch the data and creates a function (using linear model – lm) that cleans up the data and then see if we use everything in one level of taxonomy rank, is there any significance.

canexplain <- function(x,onesymptomId){
sampWithSymptom <- samplesymptom %>% filter(symptomId==onesymptomId) %>% group_by(sampleId) %>% summarize(symptom=n())
regData1 <-left_join(samplesWithSymptoms,sampWithSymptom,by=”sampleId”) regData2<-inner_join(x,regData1,by=”sampleId”) regData2[] <-0 #Fix NA regData3 <- regData2[,colSums(regData2 > 0) > 2]
regModel <- lm(symptom ~. -sampleId,regData3) r2 <-glance(regModel)$adj.r.squared prob <-glance(regModel)$p.value symptomName=symptoms%>%filter(symptomId==onesymptomId) %>% select(symptom_name)
print(paste(deparse(substitute(x)),” for “,onesymptomId,” “,symptomName,” r2=”,r2,”, p-value=”,prob))

The above code as an image (WP format issues)

Then if I type canexplain(byPhylum,1), we get a short report (followed by more details)

Suppose you want to test the symptoms above, we can do it with code like below:

test<-canexplain(byPhylum,1) test<-canexplain(byClass,1) test<-canexplain(byFamily,1) test<-canexplain(byGenus,1) test<-canexplain(byNoRank,1) test<-canexplain(byOrder,1) test<-canexplain(bySpecies,1) test<-canexplain(bySubClass,1) test<-canexplain(bySubPhylum,1)

The results are interesting and shown below

General fatigue is explained only 38% at the genus level (genus means lactobacillus, bifidobacteria) , but appears to be well explained by specific species.

We can look at which bacteria being present has the greatest risk of fatigue.

And those which decrease risk

A word of warning: a linear model is not the best for predicting fatigue or no fatigue. A better model is logistic, which I will be covering in a future post.

Additionally, at the species level — a lot of people will have ZERO of the specific species, so we need to do a little extra caution in reading what the above mean.

A few more Runs

For IBS, we see that there is a significant association with the genus (57% explained)

We will look at them, but instead of by value, by the probability of them being significant players! (done by tidy(test) %>% arrange(p.value) )

Postive values means having this genus increases your odds, negative values decreases the odds.

Bottom Line

This was intended as a “show how to do”. There are some problems at the species level, we have 2000 bacteria (columns) and only 100 observations. This too-many choices and not enough observations can result in misleading conclusions with a linear model. This also applies to the genus level where we have 900 bacteria. I will be working through these issues in subsequent posts.