Current Predoctoral Fellows (2017-2018)
advisor: Lee-Jen Wei, Tianxi Cai
David is a fifth year PhD student. He successfully defended his thesis, ““Estimating Causal Effects in Pragmatic Settings with Imperfect Information”. He will graduate this semester, and is in the final stages of his job search. During the past year David worked on research methods for estimating treatment effects that are motivated by precision medicine. The overarching goal is to use patient data from sources such as electronic medical records (EMRs) to infer effects of treatment for different groups of patients so that optimal treatment strategies can be identified. In real-world data such as EMRs, there are a host of issues that complicate treatment effect estimation, such as availability of high- dimensional covariates, imperfectly measured variables, and fragmentation of relevant data across separate data sources. He is developing approaches that take steps to address some of these issues encountered when estimating causal effects in practice. In the first project, they developed a double-index propensity score for handling adjustment for confounding with high-dimensional covariates. They demonstrated through theoretical arguments and numerical evidence that the approach is robust to misspecification of underlying parametric model and efficient. In a second project, he extended the approach to allow for incorporation of large sets of unlabeled data for efficiency gain. The approach imputes outcomes in the unlabeled data such that consistent estimators can be obtained even under misspecification of the imputation model. It is also shown to be robust and efficient through theoretical and numerical evidence. In other work, he developed an approach that uses tree methods to identify interpretable subgroups with enhanced treatments from clinical trial data. This work aims to define subgroups that are easy to interpret for clinicians and other users based on conditional treatment effect estimates obtained from black-box algorithms that are difficult to understand. The work aims to balance interpretability and performance to help facilitate use of statistical procedures for selecting treatment in practice. In another project, David is developing an approach that integrates randomized and observational data sources to estimate conditional treatment effects. The project seeks to leverage the complementary strengths of each design to efficiently estimate treatment effects with both sources data without introducing bias from incorporation of the observational data. Based on the projects described above, David has four manuscripts in process. The first, “Estimating Average Treatment Effects with a Response-Informed Calibrated Propensity Score,” is in process and his role is to conduct theoretical and numerical studies, perform data analysis, and draft paper. The second, “Efficient and Robust Semi-Supervised Estimation of Average Treatment Effects in Electronic Medical Records Data,” is in revision and his role was to conduct theoretical and numerical studies, perform data analysis, and draft and revise paper. The third, “Identifying Treatment Subgroups by Tree Approximations of Conditional Treatment Effects, is in process and his role is to conduct theoretical and numerical studies, perform data analysis, and draft paper. And the fourth paper, “Adaptive Combination of Conditional Treatment Effect Estimators Based on Randomized and Observational Data,” is in process and his role is to conduct theoretical and numerical studies, perform data analysis, and draft paper. David has had multiple opportunities to present his research. In the summer of 2017 he presented, “Efficient and Robust Semi-Supervised Estimation of Treatment Effects in Electronic Medical Records Data,” at the ISCB Annual Meeting on 7/12/2017 in Vigo, Spain, and also at the ENAR Conference on 3/26/18 in Atlanta, Georgia. He presented, ʺEfficient and Robust Semi-Supervised Estimation of Treatment Effects in Electronic Medical Records Dataʺ on 9/18/17, and, “Adaptive Combination of Parallel Randomized- Observational Data for Conditional Effects,” on 2/26/18 at the Quantitative Issues in Cancer working group. He received the 2018 ENAR Distinguished Student Paper Award, and the 2018 JSM Biopharmaceutical Section Student Paper Award.
Leah A. Comment
advisor: Corwin Zigler
Leah is a fourth year PhD student. Leah’s main research interest at this time interests involve causal inference using observational data, with special emphasis on uncertainty for decision making. At the policy level, this means quantifying uncertainty about research conclusions with sensitivity analysis methods. For decision making on an individual patient level, this means targeting quantities that convey a richer understanding of prognosis as well as uncertainty regarding clinical course under various treatment options. She has two ongoing research projects, both of which are in the area of Bayesian causal inference in the context of cancer research. The first is a data integration method for combining cancer registry and cohort study data to perform principled sensitivity analysis for unmeasured confounding. The second project involves non-mortality time-to-event outcomes in settings with high mortality. In particular, she is developing causal estimands and Bayesian estimation strategies to characterize treatment effects on outcomes like hospital readmission when differential survival in the treatment arms leads to incomparability of the at-risk groups. Leah is currently working on two manuscripts related to this work, and she is responsible for developing the methods, designing the simulation studies, performing the analysis for the data application, and drafting the manuscript for both. In addition to her research, she also attended presented her research at various conferences. In August of 2017 she presented, “External data integration for Bayesian causal inference in the presence of unmeasured confounding in mediation,” at The Joint Statistical Meeting in Baltimore, Maryland. She also attended StanCon 2018 in Monterrey, California in January to present, “Causal inference using the g-formula in Stan”. In April 2018 she traveled to Florence, Italy to present, “Time-varying survivor average causal effects with semicompeting risks” at the European Causal Inference Meeting (EuroCIM), and she was awarded the Best Early Career Presentation for this work. And most recently, Leah co-presented, “Introduction to Bayesian Workflow in Stan” In May at the Open Data Science Conference in Boston, MA. Leah also presented her work at the Quantitative Issues in Cancer Working group. On 11/20/17 she presented, “Semicompeting risks and survivor average causal effects in continuous timeʺ. Then, on 3/19/18 she presented, ʺTime-varying survivor average causal effects with semicompeting risksʺ.
advisor: Giovanni Parmigiani
Theodore is a fourth year PhD student. Theo’s main research interest at this time is building statistical models to improve cancer familial risk prediction. He is currently working on a project that improves the BRCAPRO model, which is a statistical model that predicts breast and ovarian cancer risk. He is using a frailty model to account for risk heterogeneity across families due to unobserved genetic factors or environmental factors. This frailty model, applied to BRCAPRO, improves breast cancer risk prediction as shown through validation using data from the Cancer Genetics Network. He is currently working on a paper discussing this work, and he will be the first author and responsible for formulating the model, writing code, and analyzing the data. Theo also presented this work, “Using frailty models to improve breast cancer risk prediction” at the ENAR Conference on 3/25/18 in Atlanta, Georgia. Theo also presented his work at the Quantitative Issues in Cancer Working group. On 9/25/17 he presented, ʺUsing Frailty Models to Improve Breast Cancer Risk Predictionʺ. Then, on 1/29/18 he presented, ʺImproving the Probability of Carrying Cancer-Susceptibility Genes Using Gradient Boostingʺ.
advisor: Alkes Price
Margaux is a second year PhD student. Margaux’s main research interest at this time is statistical genetics and meta-analyses. She passed the Department’s written qualifying exam in January 2018. In the last academic year, she has taken the following courses: BST240: Probability II, BST235: Advanced Regression, BST214/238: Clinical Trials, BST241: Inference II, and BST245: Multivariate and Longitudinal Data. In addition to her course work she has been working on two research projects. In the first project she is working with Drs. Danielle Braun and Giovanni Parmigiani to estimate the prevalence of rare germline genetic mutations in the general population as it can inform genetic counseling and risk management. Most studies which estimate prevalence of mutations are performed in high-risk populations, and each study is designed with differing inclusion (i.e. ascertainment) criteria. Quantifying the ascertainment mechanisms is necessary in order to estimate the prevalence in the general population. Combining estimates from multiple studies through a meta-analysis is challenging due to the differing study designs and ascertainment mechanisms as well as the complexity of quantifying these ascertainment mechanisms. This quantification is often not straightforward as the inclusion criteria is often based on disease status and/or family history. They are working to provide guidelines on how to quantify the ascertainment mechanism for a wide range of settings. They propose a general approach for conducting a meta-analysis under these complex settings by incorporating study-specific ascertainment mechanisms into a joint likelihood function. They implement the proposed likelihood based approach using both frequentist and Bayesian methodology. They evaluate these approaches in simulations and apply our methods in an illustrative example to estimate the prevalence of PALB2 in the United States. They are currently working a manuscript for this project. Margaux will be the first author, and she is responsible for developing the methods and analysis. She is also working with Dr. Price to explore the role CREs in disease and complex trait heritability and how sequence conservation of these elements may affect their role. Margaux also presented her work at the Quantitative Issues in Cancer Working group. On 10/23/17 she presented, “Introduction to Statistical Approaches for Meta-Analysis of Genetic Mutation Prevalence”. Then, on 3/5/18 she presented, “Update on Statistical Approaches for Meta-Analysis of Genetic Mutation Prevalence”
advisor: Rebbeca Betensky
Melissa is a first year PhD student. Melissa’s main research interest at this time is extensions of the Kaplan-Meier estimator to estimate the survival functions for groups of individuals with certain time-varying covariate values. Her main focus at this time is her coursework. In the current academic year, she has completed the following courses: BST230: Probability Theory and Applications I, BST232: Methods I, EPI201: Introduction to Epidemiology: Methods I, EPI202: Epidemiologic Methods II: Elements of Epidemiologic Research, EPI249: Molecular Biology for Epidemiologists, BST214: Principles of Clinical Trials, EPI213: Epidemiology of Cancer, BST231: Statistical Inference I, BST233: Methods II, BST238: Advanced Topics in Clinical Trials. In addition to her coursework she has done some research with Dr. Betensky. They have been researching the use of time-varying covariates in Kaplan-Meier estimators. This is a common statistical issue seen in the medical literature, as researchers are often interested in displaying the survival functions of multiple cohorts for comparison. Kaplan-Meier curves are an excellent tool for displaying the estimated survival functions of cohorts defined by a baseline covariate, but they cannot appropriately demonstrate survival differences for cohorts defined by time-varying covariates. Melissa has been working on a literature review to determine the best ways to display survival for cohorts defined by time-varying covariates. Melissa has also worked with Dr. Betensky on an extension of the Kaplan-Meier estimator to account for cohorts defined by covariate paths. This extension will allow researchers to visualize how a change in covariate value at a given time point relates to patient prognosis. She conducted simulations to demonstrate how the estimator performs under different circumstances and am working to incorporate an adjustment for confounders into this estimator. This work was motivated by several time varying covariates seen in cancer research. For example, researchers are sometimes interested in whether tumor response is a prognostic indicator of survival, where tumor response status is a time-varying covariate. She presented this work at the Quantitative Issues in Cancer working group. Her first presentation, “Use of time-varying covariates in Kaplan- Meier estimators” was in December 2017. Her second presentation, “Displaying survival of groups defined by covariate paths” was in April 2018.
advisor: Martin Aryee and Christoph Lange
Divy is a fourth year PhD student. Divy’s main research interest at this time Statistical methods to understand sparse data in single-cell methylation data and rare-variant haplotype data. His dissertation research also involves building computational software and bioinformatics platforms to conduct these analyses. In Divy’s first research project they have developed a set of publicly accessible cloud-based preprocessing and quality control pipelines for bisulfite sequencing DNA methylation data that go from raw data to CpG-level methylation estimates. Leveraging cloud computing resources allows users to 1) achieve scalability to large whole genome datasets with 100GB+ of raw data per sample and to single-cell datasets with thousands of cells, 2) access best-practice analysis pipelines, 3) ensure reproducibility of analyses, and 4) enable integration and comparison between user-provided data and publicly available data (e.g. TCGA) as all samples can be processed through the same pipeline. This analysis platform is available for users in FireCloud website of Broad Institute of MIT & Harvard. His second research project also looks at single cell methylation. At present several single-cell methylation protocols exist to understand disease and normal state mechanisms but they all suffer from low coverage due to the low quantity of input DNA in a single-cell. They find that on average, only about 5 – 10% of CpGs are observed in typical single-cell libraries, and show how missingness of methylation status can bias seemingly simple metrics such as mean methylation estimates and clustering analyses. They propose a joint analysis approach that leverages either bulk sequencing data or a consensus generated from a large number of single-cells, to infer bias-corrected single-cell methylation status Divy is currently working on three manuscripts related to this work. In the first, “A (Fire)Cloud-Based DNA Methylation Data Preprocessing Platform” his role is to build the analysis platform, develop quality control analysis package, test the platform, write and edit the manuscript and maintain the tools. In the second manuscript, “Modeling missingness in single-cell DNA methylation data” Divy’s role is to conduct exploratory analysis to understand the issues in the data, develop methods and testing them with real data. In the third manuscript, “Identify population substructure via haplotype based Jaccard index,” Divy is currently conducting exploratory analysis. He has presented his work in various venues. On August 16, 2017 he presented, “Error rate Characterization in Ultra Deep UMI Sequencing” at Illumina, Inc. Then, in September of 2017 he presented, “Challenges and solution for analyzing single cell methylation data” at the Center for Cancer Research at Massachusetts General Hospital. In December 2017 he presented, ʺComputational and statistical challenges in single-cell methylation data analysisʺ at the Quantitative Issues in Cancer working group, and “A (Fire)Cloud-Based DNA Methylation Data Processing Platform” at a Broad Institute retreat. And most recently, he presented, “Modelling missingness in Single-Cell Methylation Data” at the April 2018 Quantitative Issues in Cancer working group. Divy has also attended a workshop and participated in career development activities. In February 2018 he attended an intro to SQL workshop that was held at CGIS S020 at Harvard University. In the fall of 2017 he also attended a Postdoc to Professor: Lunchtime chat on academic careers in biostatistics session as well as a Data Science in Tech Career Panel.
advisor: Martin Aryee
Kelly is a third year PhD student. She has completed all required course work. Kelly’s main research interest at this time is in using statistical and machine learning methods to identify biomarkers which can be used to develop new, low-cost diagnostic tests for detecting cancer. She is currently working on a project to explore multiple methods for detecting a signal when the cell type of interest constitutes such a small portion of the total number of cells. Hepatocellular carcinoma is the most common type of liver cancer and a leading cause of cancer death worldwide. Most cases develop within an at-risk population–namely, those with chronic liver disease. These patients are put under clinical surveillance, and diagnosis is done via imaging. However, MRIs are expensive and not always available in low-resource areas. We seek to develop a diagnostic blood test to identify those with hepatocellular carcinoma among those who are at risk as early as possible. We have RNA-seq data from blood samples which have been enriched for circulating tumor cells (if present) via CTC-chip from MGH. However, the CTCs still only make up ~1% of the cells in the sample. Kelly and her collaborators, are currently working on a paper for this work, and her role is to designing and conducting simulations as well as draft the paper.
advisor: Giovanni Parmigiani
Maya is a first year PhD student. Her main focus at this time is her coursework. In the current academic year, she has completed the following courses: BST230: Probability Theory and Applications I, BST231: Statistical Inference I, BST232: Methods I, BST233: Methods II, BST234: Introduction to Data Structures and Algorithms, EPI201: Introduction to Epidemiology: Methods I. MIT6.867: Machine Learning. This summer she plans to work with Dr. Parmigiani on Tree- weighting strategies in cross-study ensemble learning. Maya will be constructing weighting approaches to use on decision trees within the context of cross-study learner construction, and compare these approaches with the performance of Random Forest. The aim is to discover which weighting approaches work best when dealing with different levels of heterogeneity within studies, as well as in the presence or absence of interactions between covariates. She presented this work at the Quantitative Issues in Cancer working group. Her first presentation, ʺBayesian regression tree models for causal inferenceʺ was in November 2017. Then, in April 2018 she presented, ʺTree-weighting approaches in constructing cross-study learnersʺ.
advisor: Sebastien Haneuse
Tanayott is a third year PhD student. Tony’s main research interest at this time is electronic health records (EHR) which include rich data on large populations over long periods of time, but missing data is extremely common, and analyses that exclude patients on the basis of incomplete data are subject to selection bias. He is building statistical methods to adjust for selection bias due to missing data in EHR-based research, with a focus on the relationship between bariatric surgery, obesity, and cancer risk. He is working with Dr. Haneuse on building methods for addressing selection bias due to missing data that involves breaking down the complex process that governs whether or not a patient has complete data by characterizing the data provenance, or the process by which data appears in EHR. Specifically, they have developed a framework that combines multiple imputation and inverse probability weighting to adjust for selection bias in this setting. Tony has conducted extensive simulations evaluating this method, and has demonstrated bias and efficiency advantages compared to naive complete-case analyses or standard missing data methods. Through their collaboration with David Arterburn at the Kaiser Permanente Washington Health Research Institute, they have obtained data from the PROMISE study, an NIH-funded study (R01 DK092317) that examines the long-term health outcomes of patients in the Kaiser Permanente system who underwent bariatric surgery between 1997 and 2013 using electronic health records. He is in the process of formally deriving the asymptotic properties of the estimators constructed using our method, and is currently writing a manuscript which includes a detailed description of their approach, simulation results demonstrating the efficacy of the approach, and data analysis utilizing the approach in the context of the PROMISE study. His role is developing methods, designing and conducting simulations, analyzing data, and drafting the paper. Tony has also outlined an approach for his next manuscript, which involves developing methods for sensitivity analyses when data in EHR are missing not at random (MNAR); that is, when the probability that some covariate or outcome is measured depends on the value of the covariate itself, or other factors that are not completely measured in the EHR. He is interested in assessing the extent to which estimators yielded by my methods are impacted by such unobserved data. Tony presented this work at multiple venues over the past year. In October 2017 he presented, “Adjusting for selection bias due to missing data in electronic health records-based research” at the Quantitative Issues in Cancer working group (Boston, MA) in October 2017. He also presented, “Combining inverse probability weighting and multiple imputation to adjust for selection bias due to missing data in electronic health records-based research” in March 2018 at the ENAR Conference (Atlanta, Georgia), and in April 2018 at the Quantitative Issues in Cancer working group (Boston, MA) in April 2018. And finally, Tony presented, “Adjusting for selection bias due to missing data in electronic health records-based research” at the Joint Cancer Training Grant Symposium (Boston, MA) in April 2018. In Fall 2017 Tony enrolled in the course “Effective Grant and Research Proposal Writing for Biostatistics Research” offered in the Biostatistics department in which he learned how to develop an effective dissertation proposal for biostatistical research, gained experience constructively critiquing research proposals of my peers, and developed a fundable grant proposal for submission to a funding agency. His proposal, “Adjusting for selection bias due to missing data in electronic health records-based research” was funded, and is likely to begin in August 2018.
advisor: Guocheng Yuan
Sam is a fourth year PhD student. Sam’s research focuses on methods for single-cell sequencing data that: reduce the bias from technical sources by imputing dropout expression values in a non-parametric fashion, rank cell trajectory and lineage predictions in a method selection process having quantitative strategy, map single-cell data across sequencing platforms to enhance the quality of available information for downstream analyses. His secondary focus is in methods for functions that characterize time-to-event distributions. Sam’s thesis consists of an iterative adaptation of multiple imputation for technical dropout in single-cell RNA-sequencing data utilizing an inverse-variance weighting analog with the goal of improving cell clustering analyses, a robust selection method for cell trajectory models fitted to single-cell RNA-sequencing data, and a generalized method for mapping cell types across single-cell sequencing platforms (e.g. from an experiment to a cell atlas) using support vector machines. The first method is nearly complete and the other two are in development. Additionally, he has completed a collaborative manuscript developing optimal-area confidence bands for functions that characterize time-to-event distributions and am assisting with a work-in-progress regarding clinical decision scores for the drainage of pericardial effusion. Two manuscripts are in process as a result of this work. The first, “OptBand: optimal confidence bands for functions to characterize time-to-event distributions” was submitted to Biometrika on February 19, 2018. Sam was co-first author and his role was to develop theory and draft the manuscript with colleague Tom Chen, as well as design and conduct simulations and data analysis. He also presented this work in March 2018 at the ENAR Conference (Atlanta, Georgia), and at the Quantitative Issues in Cancer working group in April 2018. The second manuscript, “Multiple Imputation for Single-Cell: a nonparametric method to impute dropout in single-cell RNA-sequencing data” is in progress. Sam’s role is to develop the theory behind Rubin Dries original idea, draft the manuscript, design and conduct simulations as well as data analysis. In addition to research Sam is actively involved in the Biostatistics Student Consulting Center, and is a Co-founder and instructor at StatStart: A High School Summer Program that is held in the Harvard Chan Biostatistics Department.
Current Postdoctoral Fellows (2017-2018)
Mentor: Giovanni Parmigiani
Patil is a Postdoctoral Fellow. His main research interest is training and evaluating genomic predictors for cancer risk and severity classification. Dr. Patil’s current research project focuses on how to train predictors when multiple studies’ worth of information is available. This setting introduces questions of when to merge datasets, what to do when the number of features differs across datasets, and how inter-study heterogeneity impacts training strategies. He is also studying how predictors transfer across data-generating platforms, contexts, and technologies. There are been multiple presentations and manuscripts that have resulted from this work. Dr. Patil is currently working on three manuscripts. In the first, “Data usage strategies in the presence of cross-study heterogeneity” he devised setting and experiments, guided derivations and analysis. In the second, “Accounting for differing feature sets when training predictors in multiple studies” he also devised setting and experiments, guided derivations and analysis. And in the third, “Transferring genomic signatures for risk stratification to RNA-seq data in Multiple Myeloma” he devised experiments, aided and guided data management and data analysis. In addition, he has one accepted paper, “Training replicable predictors in multiple studies” with Dr. Parmigiani. His role was to design and perform the research, analyze data, and write the paper. Dr. Patil presented, “Setting expectations for replication in science” at the Western North American Region of the International Biometric Society meeting (Santa Fe, New Mexico) in June 2017. He also presented, “Training Replicable Predictors in Multiple Studies” in March 2018 at the ENAR Conference (Atlanta, Georgia). Prasad presented this work at the Quantitative Issues in Cancer working group. His first presentation, ʺModeling study heterogeneity and deciding when to merge datasetsʺ was in November 2017, and his second presentation, ʺVariance of the K-Fold Cross-Validation Estimatorʺ was in May 2018. In addition to his research and the associated manuscripts and presentations Dr. Patil participated in a Clinical Research Orientation Program for PhDs (CROPP) from September 2017 through November 2017. Under the guidance of a clinical research mentor, he learned how research is conducted in hospital settings, e.g. IRBs, clinical research centers, consent, ethics.