Skip to main content
Log in

Statistical Methods for the Analysis of Time–Location Sampling Data

  • Published:
Journal of Urban Health Aims and scope Submit manuscript

Abstract

Time–location sampling (TLS) is useful for collecting information on a hard-to-reach population (such as men who have sex with men [MSM]) by sampling locations where persons of interest can be found, and then sampling those who attend. These studies have typically been analyzed as a simple random sample (SRS) from the population of interest. If this population is the source population, as we assume here, such an analysis is likely to be biased, because it ignores possible associations between outcomes of interest and frequency of attendance at the locations sampled, and is likely to underestimate the uncertainty in the estimates, as a result of ignoring both the clustering within locations and the variation in the probability of sampling among members of the population who attend sampling locations. We propose that TLS data be analyzed as a two-stage sample survey using a simple weighting procedure based on the inverse of the approximate probability that a person was sampled and using sample survey analysis software to estimate the standard errors of estimates (to account for the effects of clustering within the first stage [locations] and variation in the weights). We use data from the Young Men’s Survey Phase II, a study of MSM, to show that, compared with an analysis assuming a SRS, weighting can affect point prevalence estimates and estimates of associations and that weighting and clustering can substantially increase estimates of standard errors. We describe data on location attendance that would yield improved estimates of weights. We comment on the advantages and disadvantages of TLS and respondent-driven sampling.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FIGURE 1.

Similar content being viewed by others

References

  1. Valleroy L, MacKellar DA, Karon JM, et al. HIV prevalence and associated risks in young men who have sex with men. J Amer Med Assoc. 2000; 284(2): 198–204.

    Article  CAS  Google Scholar 

  2. MacKellar D, Valleroy L, Karon J, Lemp G, Janssen R. The Young Men’s Survey: methods for estimating HIV seroprevalence and risk factors among young men who have sex with men. Public Health Rep. 1996; 111(Supplement): 138–144.

    PubMed  Google Scholar 

  3. MacKellar DA, Gallagher KM, Finlayson T, Sanchez T, Lansky A, Sullivan PS. Surveillance of HIV risk and prevention behaviors of men who have sex with men—a national application of venue-based, time–space sampling. Public Health Rep. 2007; 122(suppl 1): 39–47.

    PubMed  Google Scholar 

  4. Weinbaum CM, Lyerla R, MacKellar DA, et al. The Young Men’s Survey Phase II: hepatitis B immunization and infection among young men who have sex with men. Amer J Public Health. 2008; 98(5): 839–845.

    Article  Google Scholar 

  5. Kish L. Survey Sampling. New York, NY: Wiley; 1965.

    Google Scholar 

  6. Zou G, Donner A. Confidence interval estimation of the intraclass correlation coefficient for binary outcome data. Biometrics. 2004; 60(3): 807–811.

    Article  PubMed  Google Scholar 

  7. Cleveland WS. Visualizing Data. Summit, NJ: Hobart; 1993.

    Google Scholar 

  8. Kalton G. Methods for oversampling rare subpopulations in social surveys. Surv Methodol. 2009; 35(2): 125–141.

    Google Scholar 

  9. Marpsat M, Razafindratsima N. Survey methods for hard-to-reach populations: introduction to the special issue. Methodol Innov Online. 2010; 5(2): 3–16.

    Google Scholar 

  10. Semaan S. Time-space sampling and respondent-driven sampling with hard-to-reach populations. Methodol Innov Online. 2010; 5(2): 60–75.

    Google Scholar 

  11. Centers for Disease Control and Prevention. Prevalence and awareness of HIV infection among men who have sex with men—21 cities, United States, 2008. Morb Mortal Wkly Rep. 2009; 59(37): 1201–1207.

    Google Scholar 

  12. Oster AM, Wiegand RE, Sionean C, et al. Understanding disparities in HIV infection between black and white MSM in the United States. Epidemiol Soc. 2011; 25(8): 1103–1112.

    Google Scholar 

  13. Geol S, Salganik MJ. Assessing respondent-driven sampling. Proc Natl Acad Sci. 2010; 107(15): 6743–6747.

    Article  Google Scholar 

  14. Heckathorn DD. Respondent-driven sampling: a new approach to the study of hidden populations. Soc Probl. 1997; 44(2): 174–199.

    Article  Google Scholar 

  15. Kendall C, Kerr LRFS, Gondim RC, et al. An empirical comparison of respondent-driven sampling, time location sampling, and snowball sampling for behavioral surveillance in men who have sex with men, Fortaleza, Brazil. AIDS Behav. 2008; 12(suppl 1): 97–104.

    Article  Google Scholar 

  16. McKenzie DJ, Mistiaen J. Surveying migrant households: a comparison of census-based, snowball and intercept point surveys. J R Stat Soc A Stat Soc. 2009; 172(2): 339–360.

    Article  Google Scholar 

  17. Volz E, Heckathorn DD. Probability based estimation theory for respondent driven sampling. J of Official Stat. 2008; 24(1): 79–97.

    Google Scholar 

  18. Gile KJ, Handcock MS. Respondent-driven sampling: an assessment of current methodology. Soc Methodol. 2010; 40(1): 285–327.

    Article  Google Scholar 

  19. Wejnert C. An empirical test of respondent-driven sampling: point estimates, variance, degree measures, and out-of-equilibrium data. Soc Methodol. 2009; 39(1): 73–116.

    Article  Google Scholar 

  20. Becker RA, Chambers JM, Wilks AR. The New S Language: A Programming Environment for Data Analysis and Graphics. Pacific Grove, CA: Wadsworth & Brooks/Cole;1988.

  21. Venables WN, Smith DM, and the R Development Core Team. An Introduction to R, second edition. (No city given) United Kingdom: Network Theory Limited;, 2009.

  22. http://www.cran.r-project.org/doc/contrib./Verzani-SimpleR.pdf. Accessed 21 April, 2011.

  23. http://faculty.washington.edu/tlumley/survey/doc/survey.pdf. Accessed 21 April, 2011.

  24. Lumley T. Complex Surveys: A Guide to Analysis Using R. Hoboken, NJ: John Wiley & Sons, Inc.; 2010.

Download references

Acknowledgments

We thank Christopher H. Johnson, Nevin Krishna, Lillian S. Lin, Alexandra Oster, and Ryan E. Wiegand, Centers for Disease Control and Prevention (CDC), for helpful comments on the manuscript. John Karon’s work was done as a contractor for CDC.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to John M. Karon.

Additional information

The findings and conclusions presented here are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention.

Appendix: computational software

Appendix: computational software

We analyze a time–location survey as a two-stage cluster sample, sampling of venues (PSUs) and sampling within venues (the second stage). The number of persons sampled varies among venues, and persons are sampled with unequal probabilities. The sample should be analyzed using software that can accommodate such a sampling design.

Many software packages could be used, some of which are free. The website www.hcp.med.harvard.edu/statistics/survey-soft/ contains brief descriptions of and links to many of these packages. Well-known commercial packages that could be used include SUDAAN, SAS, Stata, and SPSS (with an additional module). We show code from SAS, which produced the results in Tables 6 and 7 and from R, a free package.

We assume that the dataset has one row for each person sampled. As an illustration, we assume that for each person we have the following variables: Site, the area in which sampling was done, if there were multiple areas, as in our data; Stratum, a code for stratum within site—the probability of sampling a PSU is uniform within each stratum; Venue, the venue at which sampling was done; Weight, the sampling weight, the inverse of a constant times an estimate of the probability that the person was sampled; Status, an indicator variable for the presence of the condition for which prevalence is to be estimated (1 if the condition is present); and RiskFactor(s), one or more variables with data on a risk factor for which the association with the condition is to be estimated (in Table 7, unprotected anal intercourse in the past 6 months). In addition, if the sampling probability for PSUs was not constant, for analyses in SAS we need a second data set with one row for each site and stratum combination, with variables Site, Stratum, and Rate, where Rate is the proportion of PSUs sampled in that stratum. Status, Weight, RiskFactor(s), and Rate should be numeric.

Analysis in SAS

We assume these data sets are SAS data sets with names DataPersons, and, if sampling probabilities vary among PSUs, DataRates, in the folder c:/SAS/data. To obtain the estimated prevalences and standard errors in Table 6, we used the following code:

LIBNAME in “c:/SAS/data”;

Data Step1; set in.DataPersons; run;

Proc sort data = Step1; by Site; run;

Proc SurveyMeans data = Step1 Rate = <value or DataRates > mean stderr var clm;

by Site;

cluster Venue; Var Status; weight Weight;

ODS output Statistics = SummaryData;

run;

If all PSUs are sampled with the same probability, Rate should equal the sampling probability, and the cluster statement should be omitted. The ODS statement uses the Output Delivery System to create the data set SummaryData with one row for each value in site. For each site, the dataset contains the estimated prevalence (mean), and its standard error (stderr), variance (var, useful for computing design effects), and a confidence interval (clm, by default, a 95% interval). The SurveyMeans procedure produces several pages of output for each site.

The code for logistic regression results uses the SurveyLogistic procedure available in SAS version 9. It is easiest to do the analysis for a single site. Let step2 be the data set with data for one of the sites. If sampling probabilities vary among PSUs for this site, let DataRatesSite be the data set restricted to this site.

Proc SurveyLogistic data = step2 Rate = <value or DataRatesSite>;

strata Stratum; cluster Venue; weight Weight;

ODS output ParameterEstimates = ParameterEsts;

model Status (desc) = RiskFactor;

run;

The ODS statement creates a data set with parameter estimates. If there are repeated analyses, the data sets can be concatenated to summarize the output.

Analysis in R

R is a powerful statistical programming package that implements the S language20 and contains functions to implement many statistical procedures. The package can be downloaded from http://cran.r-project.org/; you will also need the library named survey, obtained from a link on the same home page. The Help feature on the R commands page has links to two useful documents, An Introduction to R, and R Data Import/Export under Manuals. See Venables et al.21 for an introduction to R; a preliminary 114 page version is available from the R website.22 Another useful document is A Survey Analysis Example,23 by Thomas Lumley, author of the R survey library; see also Lumley’s book.24 The following discussion assumes that a user has learned the basics of R.

Survey Data Analysis

The structure of the dataset is the same as for SAS, except that there should be an additional variable FPC, which is either the number of venues in the sampling frame from which the venue was chosen, or the sampling probability for the venue. If the venue was chosen with certainty, FPC should be 1. To ignore the finite population correction, set FPC to a small number (such as 0.001). Persons with a missing value for essential data (sampling weight, venue) should be removed. If the dataset is ASCII, it should be edited to change the missing value code to NA; otherwise, use a distinctive numeric code for missing data. It is desirable, but not necessary, to have the variable names in the first row of the data set. It is easiest if this dataset is saved in the folder containing the R software.

We first define a data frame containing the data. If the data set is in a project folder, replace “filename” by the path and filename. The operator “<-” defines an object to be the value of the function or expression on the right-hand side of this operator.

library(survey) # load the survey library

DataFrame < - read.table(“filename”, header = TRUE) # variable names in data set

DataFrame < - read.table(“filename”, header = FALSE, col.names = c(character list)

# variable names not in row 1 of data set

# character list is “variable1”, “variable2”,…,“variablev”

DataFrame[1:10,] # look at the data for the first 10 rows of the data set

We now analyze the prevalence of Status. We use one of the first three assignments below to create a new data frame from which persons with NA are removed. The operator “!” is the logical operator “not”. Use the third assignment if the missing value has the numeric value Mcode.

DataStatus < - DataFrame[!is.na(DataFrame[, “Status”],]

DataStatus < - DataFrame[(DataFrame[, “Status”] ! = NA,]

DataStatus < - DataFrame[(DataFrame[, “Status”] ! = Mcode,]

# create an object with the survey design information (order of function # arguments is arbitrary)

SurveyStatus < - svydesign(data = DataStatus, ids = ∼Venue, strata = NULL,

weights = ∼Weight, fpc = ∼NFrame)

svymean(x = ∼Status, design = SurveyStatus, deff = TRUE) # prevalence estimate

In the previous two statements, note that it is essential to have the tilde (“∼”) in defining some function arguments. svymean() returns the variable analyzed (Status), the weighted mean, the standard error, and the design effect. Omit the specification of the design effect if is not to be computed.

To obtain the logistic regression estimate of the association between Status and RiskFactor, we first remove persons with missing values of RiskFactor, e.g.

DataStatusRisk < - DataStatus[!is.na(DataStatus[, “RiskFactor”]),]

SurveyStatusRisk < - svydesign(data = DataStatusRisk, id = ∼Venue, weights = ∼Weight,

fpc = ∼NFrame)

logitmodelStatus < - svyglm(Status ∼ RiskFactor, design = SurveyStatusRisk,

family = quasibinomial())

summary(logitmodelStatus) # results of logistic regression

summary() prints information about the model used and, for the intercept and each risk factor, the regression coefficient, standard error, Student’s t value, and p value.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Karon, J.M., Wejnert, C. Statistical Methods for the Analysis of Time–Location Sampling Data. J Urban Health 89, 565–586 (2012). https://doi.org/10.1007/s11524-012-9676-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11524-012-9676-8

Keywords

Navigation