Yearb Med Inform 2014; 23(01): 42-47
DOI: 10.15265/IY-2014-0018
Original Article
Georg Thieme Verlag KG Stuttgart

Technical Challenges for Big Data in Biomedicine and Health: Data Sources, Infrastructure, and Analytics

N. Peek
1   Dept. of Medical Informatics, Academic Medical Center, University of Amsterdam, The Netherlands
2   Centre for Health Informatics, Institute of Population Health , University of Manchester, Manchester, UK
,
J. H. Holmes
3   Center for Clinical Epidemiology and Biostatistics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
,
J. Sun
4   College of Computing, Georgia Institute of Technology, Atlanta, GA, USA
› Author Affiliations
Further Information

Publication History

29 January 2015

Publication Date:
05 March 2018 (online)

Summary

Objectives: To review technical and methodological challenges for big data research in biomedicine and health.

Methods: We discuss sources of big datasets, survey infrastructures for big data storage and big data processing, and describe the main challenges that arise when analyzing big data. Results: The life and biomedical sciences are massively contributing to the big data revolution through secondary use of data that were collected during routine care and through new data sources such as social media. Efficient processing of big datasets is typically achieved by distributing computation over a cluster of computers. Data analysts should be aware of pitfalls related to big data such as bias in routine care data and the risk of false-positive findings in high-dimensional datasets. Conclusions: The major challenge for the near future is to transform analytical methods that are used in the biomedical and health domain, to fit the distributed storage and processing model that is required to handle big data, while ensuring confidentiality of the data being analyzed.

 
  • References

  • 1 Meystre SM, Savova GK, Kipper-Schuler KC, Hurdle JF. Extracting information from textual documents in the electronic health record: a review of recent research. Yearb Med Inform 2008; 128-44.
  • 2 Friedman C, Elhadad N. Natural language processing in health care and biomedicine. In: Shortliffe EH, Cimino JJ. (eds.). Biomedical Informatics. Computer Applications in Health Care and Biomedicine (4th ed.) London: Springer; 2014. p. 255-84
  • 3 Deserno TM. Biomedical Image Processing. Berlin: Springer; 2011
  • 4 Rubin DL, Greenspan H, Brinkley JF. Biomedical Imaging Informatics. In: Shortliffe EH, Cimino JJ. editors. Biomedical Informatics. Computer Applications in Health Care and Biomedicine (4th ed.) London: Springer; 2014. p. 285-327
  • 5 Eysenbach G, Köhler C. Health-Related Searches on the Internet. JAMA 2004; 291: 2946.
  • 6 Carneiro HA, Mylonakis E. Google trends: a web-based tool for real-time surveillance of disease outbreaks. Clin Infect Dis 2009; 49 (Suppl. 10) 1557-64.
  • 7 Mandl KD, Overhage JM, Wagner MM, Lober WB, Sebastiani P, Mostashari F. et al. Implementing syndromic surveillance: a practical guide informed by the early experience. J Am Med Inform Assoc 2004; 11: 141-50.
  • 8 White RW, Tatonetti NP, Shah NH, Altman RB, Horvitz E. Web-scale pharmacovigilance: listening to signals from the crowd. J Am Med Inform Assoc 2013; 20 (Suppl. 03) 404-8.
  • 9 New Tweets per second record, and how! Twitter, Inc 2014 [cited 2014 Jan 15]. Available from: URL: https://blog.twitter.com/2013/new-tweets-per-second-record-and-how
  • 10 Langmead B, Schatz MC, Lin J. et al. Searching for SNPs with cloud computing. Genome Biol 2009; 10: R134.
  • 11 Wang Y, Goh W, Wong L, Montana G. Alzheimer‘s Disease Neuroimaging Initiative. Random forests on Hadoop for genome-wide association studies of multivariate neuroimaging phenotypes. BMC Bioinformatics 2013; 14 Suppl 16-S6.
  • 12 Ng K, Ghoting A, Steinhubl SR, Stewart WF, Malin B, Sun J. PARAMO: A PARAllel predictive MOdeling platform for healthcare analytic research using electronic health records. J Biomed Inform 2014; 48: 160-70.
  • 13 Sahoo SS, Jayapandian C, Garg G, Kaffashi F, Chung S, Bozorgi A. et al. Heart beats in the cloud: distributed analysis of electrophysiological 舘Big Data‘ using cloud computing for epilepsy clinical research. J Am Med Inform Assoc 2014; 21 (Suppl. 02) 263-71.
  • 14 Zhao S, Prenger K, Smith L, Messina T, Fan H, Jaeger E. et al. Rainbow: A tool for large-scale whole-genome sequencing data analysis using cloud computing. BMC Genomics 2013; 14: 425.
  • 15 Hird SM. LociNGS: A Lightweight Alternative for Assessing Suitability of next-Generation Loci for Evolutionary Analysis. PloS One 2012; 7 (Suppl. 10) e46847.
  • 16 Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M. et al. Bigtable: A distributed storage system for structured data. ACM Trans Comput Syst 2008; 26 (Suppl. 02) 4.
  • 17 Rosenthal A, Mork P, Li MH, Stanford J, Koester D, Reynolds P. Cloud computing: a new business paradigm for biomedical information sharing. J Biomed Inform 2010; 43 (Suppl. 02) 342-53.
  • 18 Amazon Web Services.. Creating Healthcare Data Applications to Promote HIPAA and HITECH Compliance. White paper, Amazon, August 2012 http://media.amazonwebservices.com/AWS_HIPAA_Whitepaper_Final.pdf (last accessed 20 May 2014)
  • 19 Jeffrey D, Ghemawat S. MapReduce: Simplified data processing on large clusters. Sixth Symposium on Operating Systems Design & Implementation (OSDI); 2004; 137-50.
  • 20 Ng K, Ghoting A, Steinhubl SR, Stewart WF, Malin B, Sun J. PARAMO: A PARAllel Predictive MOdeling Platform for Healthcare Analytic Research Using Electronic Health Records. J Biomed Inform 2013, doi:10.1016/j.jbi.2013.12.012.
  • 21 Xin RS, Rosen J, Zaharia M, Franklin MJ, Shenker S, Stoica I. Shark: SQL and rich analytics at scale. ACM SIGMOD Conference 2013 1145/2463676.2465288.
  • 22 Low Y, Gonzalez J, Kyrola A, Bickson D, Guestrin C, Hellerstein JM. Graphlab: A new parallel framework for machine learning. In: Grünwald P, Spirtes P. editors. Proc 26th Conference on Uncertainty in Artificial Intelligence. AUAI Press; 2010. p. 340-9
  • 23 McPheeters ML, Sathe NA, Jerome RN, Carnahan RM. Methods for systematic reviews of administrative database studies capturing health outcomes of interest. Vaccine 2013; 31 (Suppl. 10) K2-6.
  • 24 Greenwald P, Friedlander BR, Lawrence CE, Hearne T, Earle K. Diagnostic sensitivity bias -- an epidemiologic explanation for an apparent brain tumor excess. J Occup Med 1981; 23 (Suppl. 10) 690-4.
  • 25 Tessier-Sherman B, Galusha D, Taiwo OA, Cantley L, Slade MD, Kirsche SR. et al. Further validation that claims data are a useful tool for epidemiologic research on hypertension. BMC Public Health 2013; 13: 51.
  • 26 Ritchie MD, Denny JC, Crawford DC, Ramirez AH, Weiner JB, Pulley JM. et al. Robust replication of genotype-phenotype associations across multiple diseases in an electronic medical record. Am J Hum Genet 2010; 86 (Suppl. 04) 560-72.
  • 27 D’Agostino RB Jr.. Propensity score methods for bias reduction in the comparison of a treatment to a non-randomized control group. Stat Med 1998; 17 (Suppl. 19) 2265-81.
  • 28 De Vries H, Kemps HMC, Van Engen-Verheul MM, Kraaijenhagen RA, Peek N. Cardiac abilitation and survival in a large representative community cohort of Dutch patients. Submitted for publication.
  • 29 Friedman JH. Greedy function approximation: A gradient boosting machine. Annals of Statistics 2001; 29 (Suppl. 05) 1189-232.
  • 30 Risch N, Merikangas K. The future of genetic studies of complex human diseases. Science 1996; 273: 1516-7.
  • 31 Hirschhorn JN, Daly MJ. Genome-wide association studies for common diseases and complex traits. Nat Rev Genet 2005; Feb 6 (Suppl. 02) 95-108.
  • 32 Ginsberg J, Mohebbi MH, Patel RS, Brammer L, Smolinski MS, Brilliant L. Detecting influenza epidemics using search engine query data. Nature 2009 Feb 19 457 7232 1012-4.