This vignette provides several examples of preparing verbal autopsy (VA) data using the CrossVA function odk2openVA and then assigning a cause of death (CoD) using openVA. CrossVA is designed to work with VA data collected using a questionnare with the same format as the 2016 VA instruments developed by the World Health Organization (WHO) – versions 1.4.1 and 1.5.1 of the 2016 WHO VA instrument are currently supported.

CrossVA also includes tools for converting data from the 2016 questionnaire into the format of the 2012 questionnaire, e.g., map_records. These functions create inputs for older versions of the InterVA and InSilicoVA algorithms and, thus, are excluded from this vignette. We recommend using odk2openVA as well as the newest versions of the InSilicoVA and InterVA algorithms.

Example R Sessions

CrossVA has two sets of functions:

  1. map_records – convert 2016 into 2012 format to work with insilico(data.type = “WHO2012”) and InterVA4
  2. odk2openVA – convert 2016 form to work with insilico(data.type = “WHO2016”) and InterVA5

and both sets of function accept comma-separated-values (CSV) files as inputs, which were created with the Open Data Kit Briefcase program.

Before we load our data and assign CoD, we need to know the path of the folder that contains the data. Sometimes it is useful to change R’s current working directory to the folder where the data files is located:

# Print the current working directory
getwd()
#> [1] "C:/Users/LeoMessi/"

# Change your current working directory as follows
setwd("C:/Users/LeoMessi/Verbal-Autopsy")
getwd()
#> [1] "C:/Users/LeoMessi/Verbal-Autopsy"

# Print the files in your current working directory with the dir() function
# (the CSV file you created with ODK Briefcase should be listed)
dir()
#> [1] "vaData_2016.csv"    "vaData_2012.csv" "rCode_for_cleaning_vaData.R"
#> [2] "vaData_report.pdf"  "cat_Videos"

# Load data into R
odkexport <- read.csv("vaData_2016.csv", stringsAsFactors = FALSE)

If you prefer to keep your data in one folder and your R code in another directory, it is possible to use the path to the CSV data file exported by ODK Briefcase – this is used below with the example data file who151_va_output.csv.

Analysis of 2016 WHO Verbal Autopsy Instrument

Again, CrossVA currently supports version 1.5.1 and 1.4.1 of the 2016 WHO VA questionnaire.

Questionnaire version 1.5.1

# Start by loading the CrossVA and openVA packages from your library
library(CrossVA)
library(openVA)

# Load the CSV from ODK Briefcase (here we use the example data file from the CrossVA package)
## fileNames_v151 contains the path to the example data file
fileName_v151 <- system.file("sample", "who151_odk_export.csv", package = "CrossVA")
fileName_v151
#> [1] "C:/Users/LeoMessi/R/win-library/3.5/CrossVA/sample/who151_va_output.csv"

dir("C:/Users/LeoMessi/R/win-library/3.5/CrossVA/sample/")
#> [1] "who151_odk_export.csv"    "who141_odk_export.csv"    "who_va_output.csv"

# Use the read.csv() function to load the data
odkexport_v151 <- read.csv(fileName_v151, stringsAsFactors = FALSE)

Now that the CSV file is loaded into R, we can use CrossVA’s function odk2openVA to convert the CSV file into the proper format (i.e., a data frame with 354 columns).

# Convert VAs using the odk2openVA() function
## we will be able to use either InterVA5 or insilico(data.type = "WHO2016") to assign CoD
openva_input_v151 <- odk2openVA(odkexport_v151, version = "1.5.1")

# For 2016 WHO VA instrument, the output needs to have 354 columns (1 ID + 353 symptoms)
dim(openva_input_v151)
#> [1]   4 354

# ID must be the first column
names(openva_input_v151)
#>   [1] "ID"    "i004a" "i004b" "i019a" "i019b" "i022a" "i022b" "i022c"
#>   [9] "i022d" "i022e" "i022f" "i022g" "i022h" "i022i" "i022j" "i022k"
#>  [17] "i022l" "i022m" "i022n" "i059o" "i077o" "i079o" "i082o" "i083o"
#>  [25] "i084o" "i085o" "i086o" "i087o" "i089o" "i090o" "i091o" "i092o"
#>  [33] "i093o" "i094o" "i095o" "i096o" "i098o" "i099o" "i100o" "i104o"
#>  [41] "i105o" "i106a" "i107o" "i108a" "i109o" "i110o" "i111o" "i112o"
#>  [49] "i113o" "i114o" "i115o" "i116o" "i120a" "i120b" "i123o" "i125o"
#>  [57] "i127o" "i128o" "i129o" "i130o" "i131o" "i132o" "i133o" "i134o"
#>  [65] "i135o" "i136o" "i137o" "i138o" "i139o" "i140o" "i141o" "i142o"
#>  [73] "i143o" "i144o" "i147o" "i148a" "i148b" "i148c" "i149o" "i150a"
#>  [81] "i151a" "i152o" "i153o" "i154a" "i154b" "i155o" "i156o" "i157o"
#>  [89] "i158o" "i159o" "i161a" "i165a" "i166o" "i167a" "i167b" "i168o"
#>  [97] "i169a" "i169b" "i170o" "i171o" "i172o" "i173a" "i174o" "i175o"
#> [105] "i176a" "i178a" "i181o" "i182a" "i182b" "i182c" "i183a" "i184a"
#> [113] "i185o" "i186o" "i187o" "i188o" "i189o" "i190o" "i191o" "i192o"
#> [121] "i193o" "i194o" "i195o" "i197a" "i197b" "i199a" "i199b" "i200o"
#> [129] "i201a" "i201b" "i203a" "i204o" "i205a" "i205b" "i207o" "i208o"
#> [137] "i209a" "i209b" "i210o" "i211a" "i212o" "i213o" "i214o" "i215o"
#> [145] "i216a" "i217o" "i218o" "i219o" "i220o" "i221a" "i221b" "i222o"
#> [153] "i223o" "i224o" "i225o" "i226o" "i227o" "i228o" "i229o" "i230o"
#> [161] "i231o" "i232a" "i233o" "i234a" "i234b" "i235a" "i235b" "i235c"
#> [169] "i235d" "i236o" "i237o" "i238o" "i239o" "i240o" "i241o" "i242o"
#> [177] "i243o" "i244o" "i245o" "i246o" "i247o" "i248a" "i249o" "i250a"
#> [185] "i251o" "i252o" "i253o" "i254o" "i255o" "i256o" "i257o" "i258o"
#> [193] "i259o" "i260a" "i260b" "i260c" "i260d" "i260e" "i260f" "i260g"
#> [201] "i261o" "i262a" "i263a" "i263b" "i264o" "i265o" "i266a" "i267o"
#> [209] "i268o" "i269o" "i270o" "i271o" "i272o" "i273o" "i274a" "i275o"
#> [217] "i276o" "i277o" "i278o" "i279o" "i281o" "i282o" "i283o" "i284o"
#> [225] "i285a" "i286o" "i287o" "i288o" "i289o" "i290o" "i294o" "i295o"
#> [233] "i296o" "i297o" "i298o" "i299o" "i300o" "i301o" "i302o" "i303a"
#> [241] "i304o" "i305o" "i306o" "i309o" "i310o" "i312o" "i313o" "i314o"
#> [249] "i315o" "i316o" "i317o" "i318o" "i319a" "i319b" "i320o" "i321o"
#> [257] "i322o" "i323o" "i324o" "i325o" "i326o" "i327o" "i328o" "i329o"
#> [265] "i330o" "i331o" "i332a" "i333o" "i334o" "i335o" "i336o" "i337a"
#> [273] "i337b" "i337c" "i338o" "i340o" "i342o" "i343o" "i344o" "i347o"
#> [281] "i354o" "i355a" "i356o" "i357o" "i358a" "i360a" "i360b" "i360c"
#> [289] "i361o" "i362o" "i363o" "i364o" "i365o" "i367a" "i367b" "i367c"
#> [297] "i368o" "i369o" "i370o" "i371o" "i372o" "i373o" "i376o" "i377o"
#> [305] "i382a" "i383o" "i384o" "i385a" "i387o" "i388o" "i389o" "i391o"
#> [313] "i393o" "i394a" "i394b" "i395o" "i396o" "i397o" "i398o" "i399o"
#> [321] "i400o" "i401o" "i402o" "i403o" "i404o" "i405o" "i406o" "i408o"
#> [329] "i411o" "i412o" "i413o" "i414a" "i415a" "i418o" "i419o" "i420o"
#> [337] "i421o" "i422o" "i423o" "i424o" "i425o" "i426o" "i427o" "i428o"
#> [345] "i450o" "i451o" "i452o" "i453o" "i454o" "i455o" "i456o" "i457o"
#> [353] "i458o" "i459o"

Now that the VA records have been converted into the expected format, we can use the tools in the openVA package to analyze the data. There are separate functions for each algorithm: InterVA, InterVA5, and insilico. For your convenience, openVA also includes a wrapper function, codeVA, which call any of these algorithms to assign CoD.

# InterVA5
run1 <- InterVA5(openva_input_v151,
                 HIV = "l",
                 Malaria = "l",
                 write = TRUE,
                 directory = getwd())
#> .25% completed
#> .50% completed
#> .75% completed
#> .100% completed

# We could also use codeVA() to get the same results:
## run1 <- codeVA(openva_input_v151,
##                data.type = "WHO2016",
##                model = "InterVA",
##                version = "5.0",
##                HIV = "l",
##                Malaria = "l",
##                write = TRUE,
##                directory = getwd())

By default the parameter write = TRUE, which requires that we pass an argument to directory – the folder where the log file is created. The log file includes information about the VA records that are excluded from the analysis (usually because they have a missing value for age and/or sex) as well as any changes made to ensure the indicators are consistent with each other. We can use the following commands to summarize the results.

# List the top 5 causes in the Cause-Specific Mortality Fraction (CSMF)
summary(run1)
#> InterVA5 fitted on 4 deaths
#> CSMF calculated using reported causes by InterVA5 only
#> The remaining probabilities are assigned to 'Undetermined'
#> 
#> Top 5 CSMFs:
#>  cause                  likelihood
#>  Road traffic accident  0.2500    
#>  HIV/AIDS related death 0.2500    
#>  Diabetes mellitus      0.1727    
#>  Undetermined           0.1594    
#>  Intentional self-harm  0.1023    
#> 
#> Top 5 Circumstance of Mortality Category:
#>  cause          likelihood
#>  Knowledge      0.50      
#>  Culture        0.25      
#>  Emergency      0.25      
#>  Health systems 0.00      
#>  Inevitable     0.00

# We can list more causes with the top parameter.
summary(run1, top = 10)
#> InterVA5 fitted on 4 deaths
#> CSMF calculated using reported causes by InterVA5 only
#> The remaining probabilities are assigned to 'Undetermined'
#> 
#> Top 10 CSMFs:
#>  cause                            likelihood
#>  Road traffic accident            0.2500    
#>  HIV/AIDS related death           0.2500    
#>  Diabetes mellitus                0.1727    
#>  Undetermined                     0.1594    
#>  Intentional self-harm            0.1023    
#>  Assault                          0.0655    
#>  Sepsis (non-obstetric)           0.0000    
#>  Acute resp infect incl pneumonia 0.0000    
#>  Diarrhoeal diseases              0.0000    
#>  Malaria                          0.0000    
#> 
#> Top 6 Circumstance of Mortality Category:
#>  cause          likelihood
#>  Knowledge      0.50      
#>  Culture        0.25      
#>  Emergency      0.25      
#>  Health systems 0.00      
#>  Inevitable     0.00      
#>  Resources      0.00

# Create a bar plot of the CSMF.
plotVA(run1)


# InterVA5 will also write an CSV file, called VA5_result.csv, with the CoDs for each record.
# Also note that InterVA5 created the log file, errorlogV5.txt
dir()
#> [1] "errorlogV5.txt"                "using-crossva-and-openva.html"
#> [3] "using-crossva-and-openva.Rmd"  "VA5_result.csv"

We can also assign CoDs using the InSilicoVA algorithm.

run2 <- insilico(openva_input_v151, data.type = "WHO2016")
#> Performing data consistency check...
#> Data check finished.
#> Warning: 196 symptom missing completely and added to missing list 
#> List of missing symptoms: 
#>  i004b, i022d, i022e, i022f, i022g, i022l, i022m, i022n, i059o, i077o, i079o, i082o, i083o, i084o, i085o, i086o, i087o, i089o, i090o, i091o, i092o, i093o, i127o, i128o, i129o, i130o, i131o, i132o, i136o, i140o, i147o, i148a, i148c, i149o, i150a, i151a, i152o, i154b, i158o, i159o, i161a, i165a, i166o, i167a, i168o, i170o, i172o, i181o, i182a, i184a, i195o, i197b, i199a, i199b, i200o, i201b, i203a, i204o, i205a, i207o, i208o, i209a, i217o, i218o, i221a, i224o, i225o, i233o, i240o, i241o, i242o, i243o, i244o, i245o, i246o, i247o, i249o, i250a, i251o, i254o, i257o, i259o, i260a, i260b, i260c, i260d, i260e, i260f, i260g, i261o, i262a, i263a, i263b, i264o, i265o, i266a, i267o, i268o, i269o, i270o, i272o, i274a, i275o, i276o, i277o, i278o, i279o, i281o, i285a, i287o, i288o, i289o, i290o, i294o, i295o, i296o, i297o, i298o, i299o, i300o, i301o, i302o, i303a, i304o, i305o, i306o, i309o, i310o, i312o, i313o, i314o, i315o, i316o, i317o, i318o, i319a, i319b, i320o, i321o, i322o, i323o, i324o, i325o, i326o, i327o, i328o, i329o, i330o, i331o, i332a, i333o, i334o, i335o, i336o, i337a, i337b, i337c, i338o, i340o, i342o, i343o, i344o, i347o, i354o, i355a, i356o, i357o, i358a, i360a, i360b, i360c, i361o, i362o, i363o, i364o, i365o, i367a, i367b, i367c, i368o, i369o, i370o, i371o, i372o, i373o, i376o, i377o, i382a, i383o, i384o, i385a, i387o, i400o, i401o, i402o, i404o
#> Not all causes with CSMF > 0.02 are convergent.
#> Increase chain length with another 4000 iterations
#> Not all causes with CSMF > 0.02 are convergent.
#> Increase chain length with another 8000 iterations
#> Not all causes with CSMF > 0.02 are convergent.
#>  Please check using csmf.diag() for more information.

## run2 <- codeVA(openva_input_v151,
##                data.type = "WHO2016",
##                model = "InSilico",
##                version = "WHO2016")

# Print CSMF for top 6 causes
summary(run2, top = 6)
#> InSilicoVA Call: 
#> 4 death processed
#> 16000 iterations performed, with first 8000 iterations discarded
#>  800 iterations saved after thinning
#> Fitted with re-estimated conditional probability level table
#> Data consistency check performed as in InterVA4 
#> 
#> Top 6 CSMFs:
#>                                    Mean Std.Error  Lower Median  Upper
#> Road traffic accident            0.2492    0.0000 0.2492 0.2492 0.2492
#> Diabetes mellitus                0.1925    0.1589 0.0092 0.1435 0.6012
#> HIV/AIDS related death           0.1510    0.1440 0.0062 0.1000 0.4866
#> Other and unspecified infect dis 0.0850    0.1041 0.0000 0.0460 0.3300
#> Acute resp infect incl pneumonia 0.0813    0.1398 0.0000 0.0031 0.4915
#> Asthma                           0.0454    0.0594 0.0009 0.0238 0.2170

# Plot CSMF
plotVA(run2)

Questionnaire version 1.4.1

# If you have not run the previous code, make sure you have loaded the packages
# library(CrossVA)
# library(openVA)
fileName_v141 <- system.file("sample", "who141_odk_export.csv", package = "CrossVA")
odkexport_v141 <- read.csv(fileName_v141, stringsAsFactors = FALSE)

# Convert VAs using the odk2openVA() function for version 1.4.1
## we will be able to use either InterVA5 or insilico(data.type = "WHO2016") to assign CoD
openva_input_v141 <- odk2openVA(odkexport_v141, version = "1.4.1")
dim(openva_input_v141)
#> [1]  16 354

# Assign CoD with model = InterVA5 and codeVA
run3 <- codeVA(openva_input_v141,
               data.type = "WHO2016",
               model = "InterVA",
               version = "5.0",
               HIV = "l",
               Malaria = "l",
               write = TRUE,
               directory = getwd())
#> ..12% completed
#> ..25% completed
#> ..38% completed
#> ..50% completed
#> ..62% completed
#> ..75% completed
#> ..88% completed
#> ..100% completed

## Summarize InterVA5 results
summary(run3)
#> InterVA5 fitted on 16 deaths
#> CSMF calculated using reported causes by InterVA5 only
#> The remaining probabilities are assigned to 'Undetermined'
#> 
#> Top 5 CSMFs:
#>  cause                  likelihood
#>  Diarrhoeal diseases    0.3246    
#>  HIV/AIDS related death 0.1333    
#>  Assault                0.1328    
#>  Pulmonary tuberculosis 0.0914    
#>  Obstetric haemorrhage  0.0667    
#> 
#> Top 5 Circumstance of Mortality Category:
#>  cause          likelihood
#>  Culture        0.2500    
#>  Knowledge      0.2500    
#>  Emergency      0.1875    
#>  Multiple       0.1875    
#>  Health systems 0.0625
plotVA(run3)


# Assign CoD with model = InSilico and codeVA
run4 <- codeVA(openva_input_v141,
               data.type = "WHO2016",
               model = "InSilicoVA")
#> Performing data consistency check...
#> .
#> Data check finished.
#> Warning: 132 symptom missing completely and added to missing list 
#> List of missing symptoms: 
#>  i022a, i022g, i022n, i059o, i077o, i079o, i082o, i083o, i084o, i085o, i086o, i087o, i089o, i090o, i091o, i092o, i093o, i148a, i148c, i178a, i184a, i199b, i200o, i201b, i203a, i204o, i205a, i207o, i208o, i209a, i211a, i213o, i214o, i216a, i217o, i218o, i219o, i220o, i221a, i221b, i227o, i233o, i234b, i235a, i235d, i236o, i237o, i238o, i240o, i241o, i242o, i243o, i244o, i245o, i246o, i247o, i249o, i250a, i251o, i260c, i260e, i261o, i264o, i265o, i266a, i267o, i268o, i269o, i270o, i275o, i278o, i281o, i298o, i299o, i300o, i305o, i306o, i312o, i313o, i315o, i316o, i317o, i318o, i322o, i325o, i328o, i330o, i331o, i333o, i334o, i337a, i337b, i340o, i354o, i355a, i356o, i357o, i358a, i360a, i360b, i360c, i361o, i362o, i363o, i364o, i365o, i367a, i367b, i367c, i368o, i369o, i370o, i371o, i372o, i373o, i376o, i377o, i382a, i383o, i384o, i385a, i387o, i393o, i394a, i395o, i396o, i397o, i398o, i399o, i400o, i401o, i402o
#> Not all causes with CSMF > 0.02 are convergent.
#> Increase chain length with another 10000 iterations
#> Not all causes with CSMF > 0.02 are convergent.
#> Increase chain length with another 20000 iterations
#> Not all causes with CSMF > 0.02 are convergent.
#>  Please check using csmf.diag() for more information.

## Summarize InSilicoVA results
summary(run4)
#> InSilicoVA Call: 
#> 16 death processed
#> 40000 iterations performed, with first 20000 iterations discarded
#>  2000 iterations saved after thinning
#> Fitted with re-estimated conditional probability level table
#> Data consistency check performed as in InterVA4 
#> 
#> Top 10 CSMFs:
#>                                     Mean Std.Error  Lower Median  Upper
#> Diarrhoeal diseases               0.3067    0.1113 0.1229 0.2931 0.5551
#> Assault                           0.1875    0.0000 0.1875 0.1875 0.1875
#> Malaria                           0.1598    0.0891 0.0281 0.1466 0.3650
#> HIV/AIDS related death            0.1189    0.0841 0.0113 0.1006 0.3187
#> Pulmonary tuberculosis            0.0684    0.0601 0.0030 0.0533 0.2309
#> Other and unspecified cardiac dis 0.0524    0.0504 0.0023 0.0373 0.1867
#> Obstetric haemorrhage             0.0486    0.0525 0.0005 0.0287 0.1866
#> Tetanus                           0.0075    0.0209 0.0000 0.0002 0.0587
#> Haemorrhagic fever (non-dengue)   0.0074    0.0144 0.0000 0.0014 0.0491
#> Dengue fever                      0.0049    0.0135 0.0000 0.0014 0.0372
plotVA(run4)