- Browse by Author
Browsing by Author "Meng, Weilin"
Now showing 1 - 6 of 6
Results Per Page
Sort Options
Item An Automated Line-of-Therapy Algorithm for Adults With Metastatic Non-Small Cell Lung Cancer: Validation Study Using Blinded Manual Chart Review(JMIR Publications, 2021-10-12) Meng, Weilin; Mosesso, Kelly M.; Lane, Kathleen A.; Roberts, Anna R.; Griffith, Ashley; Ou, Wanmei; Dexter, Paul R.; Biostatistics & Health Data Science, School of MedicineBackground: Extraction of line-of-therapy (LOT) information from electronic health record and claims data is essential for determining longitudinal changes in systemic anticancer therapy in real-world clinical settings. Objective: The aim of this retrospective cohort analysis is to validate and refine our previously described open-source LOT algorithm by comparing the output of the algorithm with results obtained through blinded manual chart review. Methods: We used structured electronic health record data and clinical documents to identify 500 adult patients treated for metastatic non-small cell lung cancer with systemic anticancer therapy from 2011 to mid-2018; we assigned patients to training (n=350) and test (n=150) cohorts, randomly divided proportional to the overall ratio of simple:complex cases (n=254:246). Simple cases were patients who received one LOT and no maintenance therapy; complex cases were patients who received more than one LOT and/or maintenance therapy. Algorithmic changes were performed using the training cohort data, after which the refined algorithm was evaluated against the test cohort. Results: For simple cases, 16 instances of discordance between the LOT algorithm and chart review prerefinement were reduced to 8 instances postrefinement; in the test cohort, there was no discordance between algorithm and chart review. For complex cases, algorithm refinement reduced the discordance from 68 to 62 instances, with 37 instances in the test cohort. The percentage agreement between LOT algorithm output and chart review for patients who received one LOT was 89% prerefinement, 93% postrefinement, and 93% for the test cohort, whereas the likelihood of precise matching between algorithm output and chart review decreased with an increasing number of unique regimens. Several areas of discordance that arose from differing definitions of LOTs and maintenance therapy could not be objectively resolved because of a lack of precise definitions in the medical literature. Conclusions: Our findings identify common sources of discordance between the LOT algorithm and clinician documentation, providing the possibility of targeted algorithm refinement.Item Comparing PSO-based clustering over contextual vector embeddings to modern topic modeling(Elsevier, 2022-05) Miles, Samuel; Yao, Lixia; Meng, Weilin; Black, Christopher M.; Miled, Zina Ben; Electrical and Computer Engineering, School of Engineering and TechnologyEfficient topic modeling is needed to support applications that aim at identifying main themes from a collection of documents. In the present paper, a reduced vector embedding representation and particle swarm optimization (PSO) are combined to develop a topic modeling strategy that is able to identify representative themes from a large collection of documents. Documents are encoded using a reduced, contextual vector embedding from a general-purpose pre-trained language model (sBERT). A modified PSO algorithm (pPSO) that tracks particle fitness on a dimension-by-dimension basis is then applied to these embeddings to create clusters of related documents. The proposed methodology is demonstrated on two datasets. The first dataset consists of posts from the online health forum r/Cancer and the second dataset is a standard benchmark for topic modeling which consists of a collection of messages posted to 20 different news groups. When compared to the state-of-the-art generative document models (i.e., ETM and NVDM), pPSO is able to produce interpretable clusters. The results indicate that pPSO is able to capture both common topics as well as emergent topics. Moreover, the topic coherence of pPSO is comparable to that of ETM and its topic diversity is comparable to NVDM. The assignment parity of pPSO on a document completion task exceeded 90% for the 20NewsGroups dataset. This rate drops to approximately 30% when pPSO is applied to the same Skip-Gram embedding derived from a limited, corpus-specific vocabulary which is used by ETM and NVDM.Item Development and Temporal Validation of an Electronic Medical Record-Based Insomnia Prediction Model Using Data from a Statewide Health Information Exchange(MDPI, 2023-05-05) Holler, Emma; Chekani, Farid; Ai, Jizhou; Meng, Weilin; Khandker, Rezaul Karim; Ben Miled, Zina; Owora, Arthur; Dexter, Paul; Campbell, Noll; Solid, Craig; Boustani, Malaz; Electrical and Computer Engineering, School of Engineering and TechnologyThis study aimed to develop and temporally validate an electronic medical record (EMR)-based insomnia prediction model. In this nested case-control study, we analyzed EMR data from 2011–2018 obtained from a statewide health information exchange. The study sample included 19,843 insomnia cases and 19,843 controls matched by age, sex, and race. Models using different ML techniques were trained to predict insomnia using demographics, diagnosis, and medication order data from two surveillance periods: −1 to −365 days and −180 to −365 days before the first documentation of insomnia. Separate models were also trained with patient data from three time periods (2011–2013, 2011–2015, and 2011–2017). After selecting the best model, predictive performance was evaluated on holdout patients as well as patients from subsequent years to assess the temporal validity of the models. An extreme gradient boosting (XGBoost) model outperformed all other classifiers. XGboost models trained on 2011–2017 data from −1 to −365 and −180 to −365 days before index had AUCs of 0.80 (SD 0.005) and 0.70 (SD 0.006), respectively, on the holdout set. On patients with data from subsequent years, a drop of at most 4% in AUC is observed for all models, even when there is a five-year difference between the collection period of the training and the temporal validation data. The proposed EMR-based prediction models can be used to identify insomnia up to six months before clinical detection. These models may provide an inexpensive, scalable, and longitudinally viable method to screen for individuals at high risk of insomnia.Item Inferring the patient’s age from implicit age clues in health forum posts(Elsevier, 2022-01) Black, Christopher M.; Meng, Weilin; Yao, Lixia; Ben Miled, Zina; Electrical and Computer Engineering, School of Engineering and TechnologyBroader patient-reported experiences in oncology are largely unknown due to the lack of available information from traditional data sources. Online health community data provide an exploratory way to uncover these experiences at a large scale. Analyzing these data can guide further studies towards understanding patients’ needs and experiences. However, analysis of online health data is inherently difficult due to the unstructured nature of these data and the variety of ways information can be expressed over text. Specifically, subscribers may not disclose critical information such as the age of the patient in their posts. In fact, the number of health forum posts that explicitly mention the age of the patient is significantly lower than the number of posts that do not include this information in the Reddit r/Cancer health forum under consideration in the present paper. Health-focused studies often need to consider or control for age as a confounder, hence the importance of having sufficient age data. This paper presents a methodology that can help classify health forum posts according to four age groups (0–17, 18–39, 40–64 and 65 + years) even when the posts do not contain explicit mention of the age of the patient. First, the subset of the posts that include explicit mention of the age of the patient is identified. Second, the explicit age clues are removed from these posts and used to train the proposed age classifier. The resulting classifier is able to infer the age of the patient using only implicit age clues with an average true positive rate (TPR) of 71%. This TPR is comparable to the average TPR of 69% obtained from human annotations for the same set of posts.Item Modeling acute care utilization: practical implications for insomnia patients(Springer Nature, 2023-02-07) Chekani, Farid; Zhu, Zitong; Khandker, Rezaul Karim; Ai, Jizhou; Meng, Weilin; Holler, Emma; Dexter, Paul; Boustani, Malaz; Ben Miled, Zina; Medicine, School of MedicineMachine learning models can help improve health care services. However, they need to be practical to gain wide-adoption. In this study, we investigate the practical utility of different data modalities and cohort segmentation strategies when designing models for emergency department (ED) and inpatient hospital (IH) visits. The data modalities include socio-demographics, diagnosis and medications. Segmentation compares a cohort of insomnia patients to a cohort of general non-insomnia patients under varying age and disease severity criteria. Transfer testing between the two cohorts is introduced to demonstrate that an insomnia-specific model is not necessary when predicting future ED visits, but may have merit when predicting IH visits especially for patients with an insomnia diagnosis. The results also indicate that using both diagnosis and medications as a source of data does not generally improve model performance and may increase its overhead. Based on these findings, the proposed evaluation methodologies are recommended to ascertain the utility of disease-specific models in addition to the traditional intra-cohort testing.Item A social and news media benchmark dataset for topic modeling(Elsevier, 2022-07-04) Miles, Samuel; Yao, Lixia; Meng, Weilin; Black, Christopher M.; Ben-Miled, Zina; Electrical and Computer Engineering, School of Engineering and TechnologyTopic modeling is an active research area with several unanswered questions. The focus of recent research in this area is on the use of a vector embedding representation of the input text with both generative and evolutionary topic modeling techniques. Unfortunately, it is hard to compare different techniques when the underlying data and preprocessing steps that were used to develop the models are not available. This paper presents two secondary datasets that can help address this gap. These datasets are derived from two primary datasets. The first consists of 8145 posts from the r/Cancer health forum and the second consists of 18,294 messages submitted to 20 different news groups. The same preprocessing procedure is applied to both datasets by removing punctuation, stop words and high frequency words. Each dataset is then clustered using three different topic modeling techniques: pPSO, ETM and NVDM and three topic numbers: 10, 20, 30. In addition, for pPSO two text embeddings representation are considered: sBERT and Skipgram. The secondary datasets were originally developed in support of a comparative analysis of the aforementioned topic modeling techniques in a study titled “Comparing PSO-based Clustering over Contextual Vector Embeddings to Modern Topic Modeling” submitted to the Journal of Information Processing and Management. The present paper provides a detailed description of the two secondary datasets including the unique identifier that can be used to retrieve the original documents, the pre-processing scripts, the topic keywords generated by the three topic modeling techniques with varying topic numbers and embedding representations. As such, the datasets allow direct comparison with other topic modeling techniques. To further facilitate this process, the algorithm underlying the evolutionary topic modeling technique, pPSO, proposed by the authors is also provided.