Using Pointwise Mutual Information for Breast Cancer Health Disparities Research With SEER-Medicare Claims

Identification of procedures using International Classification of Diseases or Healthcare Common Procedure Coding System codes is challenging when conducting medical claims research. We demonstrate how Pointwise Mutual Information can be used to find associated codes. We apply the method to an investigation of racial differences in breast cancer outcomes. We used Surveillance Epidemiology and End Results (SEER) data linked to Medicare claims. We identified treatment using two methods. First, we used previously published definitions. Second, we augmented definitions using codes empirically identified by the Pointwise Mutual Information statistic. Similar to previous findings, we found that presentation differences between Black and White women closed much of the estimated survival curve gap. However, we found that survival disparities were completely eliminated with the augmented treatment definitions. We were able to control for a wider range of treatment patterns that might affect survival differences between Black and White women with breast cancer.

a proxy for patients' actual treatments. There are thousands of International Classification of Diseases (ICD)-9, ICD-10, and Common Procedural Terminology (CPT) codes in use that identify diagnoses and procedures in medical claims, those codes are updated regularly, and there are numerous ways to encode patients' conditions and treatments. Medicare incorporates CPT codes into Healthcare Common Procedure Coding System (HCPCS) codes.
In practice, designing rules that identify treatments in Medicare data is a time consuming process based on study of claims and codes, clinical reasoning, and scientific evidence. Miller et al. (2008Miller et al. ( , 2009, for example, developed an algorithm for identifying laparoscopic surgery among kidney cancer cases before claims codes for laparoscopic surgery were well developed. While such algorithms are useful for others pursuing similar investigations (Smaldone et al., 2012), there may still be substantial mismatch between treatment identified by the SEER cancer registry and treatment identified through Medicare claims. Noone et al. (2016) suggested that Medicare claims should be used to supplement SEER treatment data, as claims are more comprehensive and reliable. Indeed, Bleicher et al. (2012) found substantial mismatch between SEER listed treatments and Medicare claims identified treatments. Hence, regardless of their best efforts, investigators may still find challenging the process of identifying combinations of codes that identify specific treatments. Enhanced methods to efficiently identify relevant codes are needed.
Informed by recent advances in natural language processing, we adapted machine learning algorithms (Levy & Goldberg, 2014a, 2014b to find vector representations of diagnosis and procedure codes from Medicare claims data, in which related codes that co-occur together or occur in the same contexts or neighborhoods are clustered together. Given an initial set of codes an investigator believes are relevant for identifying a treatment, our method will automatically find related codes. The algorithm is generalizable to changes in codes, such as recent transitions from ICD-9 to ICD-10 codes. In this paper, we document a software assistant that can be used to identify related codes. We demonstrate the algorithm using a SEER-Medicare breast cancer example. We reproduced, but with more contemporary data, the work of Silber et al. (2013) who found that survival differences between Black and White women in the United States could largely be explained by differences in cancer presentation at diagnosis. That is, while Black women and White women with breast cancer have sizable survival differences, the differences were reduced after controlling for non-cancer comorbidities and severity of disease, such as tumor stage, grade, and lymph node involvement. Still, Silber et al. (2013) found that there were some residual survival differences between Black and White women, even after further controlling for the type of cancer treatment received. We examined whether identifying Medicare treatment codes using our software assistant could possibly better control for confounding when examining racial demographic differences.

Method Participants
We used Surveillance Epidemiology and End Results (SEER) data linked to Medicare claims. SEER is maintained by the National Cancer Institute and has long-term data on tumor characteristics and demographics information about incident cancers for over 14% of the United States (https://seer.cancer.gov/registries/). Expansion since 2000 has resulted in more recent data capturing over 28% of the US population. Medicare covers almost all individuals over 65 years old in SEER. Fee-for-service claims from Medicare part A and part B provide a thorough record of treatments and services obtained before and after cancer diagnosis.
We emulated the same exclusion criteria and methods detailed in the supplement of Silber et al. (2013). We primarily examined cases diagnosed from 1992 to 2005 to largely replicate the sample of Silber et al. (2013) which examined cases through 2005. Since we had additional years of data, we repeated the analyses with cases diagnosed 2006 through 2013 and claims through 2014. We restricted our breast cancer case sample to individuals with Medicare Parts A and B over the age of 66. Those with managed care contracts were excluded due to a lack of claims.
We used propensity score matching to match every Black woman to one White woman using sets of potentially confounding variables that mimic those used by Silber et al. (2013). We matched first on demographics, second on demographics and clinical presentation variables, and third on demographic, clinical presentation, and treatment variables. Silber et al. (2013) used this strategy to show that much of the survival differences between Black and White women largely disappeared after controlling for clinical presentation.

Instruments
Demographic variables included age, entered into the propensity score model via restricted cubic splines (Harrell, 2001, Ch. 2), and year of diagnosis and SEER registry, entered as categorical variables. Clinical presentation included tumor size (categorical with centimeter increments to ≥ 4 centimeters and a missing indicator), estrogen receptor positivity (ER+), progesterone receptor positivity (PR+), stage of cancer (Categorical I-IV, unknown), grade (five categories including missing) and 25 comorbidities as detailed in the tables. Many of the comorbidities used corresponded to those in the Charlson Comorbidity Index (Charlson et al., 1987). Treatment included number of nodes removed and positive, entered via restricted cubic splines with four knots, mastectomy, breast conserving therapy, radiation, surgery, chemotherapy, and particularly whether the chemotherapies were doxorubicin or taxanes. We included all two-way, three-way, and four-way treatment interactions in the propensity score model. We did not adjust for neighborhood level income or education variables, as Silber et al. (2013) did not include those in primary analyses.

Procedure
We identified treatment using two methods. First, we used the ICD-9 and CPT definitions of Silber et al. (2013) directly. We searched for chemotherapy or surgery that occurred within six months of diagnosis, or radiation therapy that occurred within nine months of diagnosis.
The second search method expanded treatment definitions. We developed a machine learning algorithm to identify HCPCS or ICD-9 procedure codes as detailed in  and Bai et al. (2019) The algorithm allows us to estimate the Pointwise Mutual Information (PMI) statistic that characterizes the strength of relationship between two HCPCS or ICD-9 codes in a Medicare claim (Turney & Pantel, 2010). The PMI relates the joint probability that two codes will be observed in the same claim divided by the probability that the codes will be observed under independence. Software can be accessed at the Supplementary Materials section.
Before presenting the software assistant that implements the algorithm, we define PMI mathematically. Let C represent a multinomial random variable denoting the HCPCS or ICD-9 value of an input code of interest, such as one of the breast cancer procedure codes identified by Silber et al. (2013). Let C′ be a similar multinomial variable representing codes in the same SEER-Medicare line of a claim of C (i.e., close to C). We assume that the code at each position in the database is an independent and identically distributed variable whether when considered as an input code (C), or a potential claim neighboring code (C′). Let subscripts i and j ∈ {1…K} (i.e., C i and C j ′ ) be index positions of the codes for a total of K codes in the dataset. K represents the total number of codes used in the database, not the number of unique values. Let D ij be a random variable that takes the value 1 if the rule for determining sufficiently close is met for C i and C j ′ , 0 otherwise. The PMI is the log of the probability that two codes are in neighborhoods of each other conditional on being in the set of codes in neighborhoods of each other (i.e., the set in which D ij = 1), divided by the probability that the two codes are independent conditional upon being in the set in which D ij = 1. By Bayes' theorem, this is also equivalent to the log of the conditional probability that C i is observed conditional on observing C j ′ and meeting the rule D ij = 1 over the conditional probability of observing C i given meeting the rule. Formally, the PMI is defined as follows.
Under independence, P C i = c C j ′ = c′, D ij = 1 would be equal to P(C i = c|D ij = 1), so PMI = log(1) = 0. If the two codes are commonly observed together, then P C i = c C j ′ = c′, D ij = 1 > P C i = c D ij = 1 and the ratio will be greater than one, so on the log scale, PMI > 0. One can tokenize the data and then use counts within the tokenized data to estimate the PMI via the component numerators and denominators, or use a logistic regression model detailed in Bai et al. (2019) and .
Our programs calculate the PMI and cosine similarity statistics for comparing two codes in claims data. The algorithms can be tested using one's own data or synthetic Medicare claims (Center for Medicare Medicaid Services., 2021). In Figure 1, we demonstrate the assistant interface. The software estimates the PMI using code counts of tokenized data and uses the word2vec method found in the python package Gensim (Řehůřek & Sojka, 2010) to find vector representations of codes based on word2vec embeddings (Bai et al., 2019). These vectors are then used to estimate the cosine similarity statistics. In Bai et al. (2019) we previously validated the methods using SEER-Medicare data in which we compared our empirically found codes to those from a clinical paper in which the expert curated codes were published in an appendix (Bleicher et al., 2012). We found that the empirical method identified many of the same codes, but also found three codes that were not listed in the curated set.
In the SEER-Medicare breast cancer data used for this project, we had 67,332,516 lines of claims, 240,150,032 codes, and 36,566 unique ICD-9 and HCPCS code values. Codes were used an average of 6,567 times (SD = 118,945). In estimation (i.e., "training"), we excluded infrequent codes used fewer than 50 times in the claims to reduce the computational burden due to high dimensional matrices. This removed 21,215 unique values, but only 218,341 codes from the total (218,341/240,150,032 = 0.1% of total). The codes of most interest were used much more than 50 times. The ICD-9 Code 85.95 ("Operations on the breast// Other operations on the breast//Insertion of breast tissue expander"), for example, was used 7,197 times in the dataset. After removing infrequent codes, each line of a claim had 239,931,691/67,332,516 = 3.56 codes on average. We considered claims to be close to each other, and thus possibly related, if they were on the same line of the claim. This gives approximately 3.56 choose 2 times 67,332,516 = 306,820,808 pairings of codes when not considering order.
After using our programs to estimate the PMI and cosine similarity statistics, one inputs a SEER-Medicare ICD or HCPCS code, and then the related codes with the top PMIs will be displayed. For the time period of our current work, ICD-9 codes were in common use; the assistant will also work with ICD-10 codes. We show an example for an Input Code 85.42, which indicates bilateral simple mastectomy. The ICD-9 Code 85.95 had the highest PMI of 5.34. The assistant will also display related codes with the largest cosine similarity statistics (Huang, 2008). In our case, 85.95 also had the highest word2vec cosine similarity of 0.647.
For each code that Silber et al. (2013) identified, we searched for ICD-9 procedure and HCPCS codes with the largest PMI similarities and used the results to augment breast conserving therapy, mastectomy, and radiation definitions. For chemotherapy receipt in general, as opposed to specific types of chemotherapy, we repeated the process, but only used CPT codes as Silber et al. (2013) had. Then, in an augmented analysis, if a case had a code for mastectomy using either Silber et al. (2013)'s definitions or the augmented definition, we classified that case as having a mastectomy. We did not augment Silber et al. (2013)'s definitions of taxane or doxorubicin chemotherapy as the codes for these are very specific. For surgery, we used the most extensive treatment received in six months. If a woman received breast conserving therapy followed by mastectomy, then we deterministically coded surgery as mastectomy. This algorithm acknowledges that many women may have multiple procedures due to reasons such as positive surgical margins on the first lumpectomy.
A rationale for using expanded and empirically derived definitions to capture treatment is that there is heterogeneity in the codes providers use for reimbursement. For example, a lumpectomy could be billed as an excisional biopsy. By deriving an empirical method of finding treatment codes, we may better identify novel or unusual billing patterns. From a causal inference perspective, the use of empirically derived coding schemes could provide better control of potential confounders. We only controlled for the traditionally identified treatment variables or the augmented variables separately, not together.

Data Analysis
After forming the matched sample, we examined overall survival using Cox Proportional Hazards regressions and breast cancer specific survival using Fine and Gray (1999) proportional hazards regressions. In Table 1 In creating our augmented treatment definitions, we remained agnostic as to whether the codes truly defined the four therapies of most interest: breast conserving therapy (BCT), mastectomy, radiation therapy, or chemotherapy. Hence, we used ICD-9 code 85.95 in the augmented mastectomy definition, even though it represents "Operations on the breast//Other operations on the breast//Insertion of breast tissue expander". Although this was not directly related to mastectomy, it was the ICD-9 code most likely to be found in the same claim with ICD-9 procedure code 85.42 which indicates "bilateral simple mastectomy." It may be reasonable to assume for purposes of controlling for confounders that a woman with breast cancer who has such a code might likely have had a mastectomy.

Results
In Figure 2, we present cumulative incidence curves of breast cancer specific mortality within five years of diagnosis. Similarly to Silber et al. (2013), we found that Black women had higher mortality than White women after matching only on a limited number of demographic variables available in the SEER data ( Figure 2a). After matching on demographic and presentation variables, much of the survival difference between Black and White women was largely attenuated (Figure 2b), but the difference was still statistically significant. However, our curves may suggest a greater narrowing of differences when adjusting for presentation variables than Silber et al. (2013) did. After further adjusting for treatments using Silber et al. (2013)'s definitions (Figure 2c), the difference in survival became less marked. When using our augmented definitions, the curves overlap, and the difference in the cumulative incidence of breast cancer death between Black and White women is largely eliminated (Figure 2d).
Our overall survival findings showed similar congruence with those of Silber et al. (2013). Racial survival differences still persist after matching on demographic variables ( Figure  3a). Again, the overall survival differences persist but are greatly reduced after controlling for presentation ( Figure 3b) and treatment variables (Figure 3c). After controlling for the augmented treatment differences, the survival curves for White and Black women are almost completely overlapping ( Figure 3d). Hence, using augmented treatment definition data pre-2006 suggests that the residual effect of race on survival after further controlling for presentation and treatment has been eliminated. Overall, these demonstrate substantial differences from the pre-2006 era reported by Silber et al. (2013).
In Supplementary Figures 2 and 3 (see the Supplementary Materials section), we replicate the analyses, but use 2006-2013 data, which is for a later time period than reported by Silber et al. (2013). We find differences from the earlier period. For breast cancer specific survival, we found that differences were largely eliminated after controlling for presentation variables (see Supplemental Figure

Discussion
Racial disparities in cancer survival outcomes have been of interest to researchers. Many simply describe the difference without providing sufficient analytic details that could explain causal mechanisms (e.g. Wheeler et al., 2013). Others only have access to a limited number of variables that can be used to control for confounding between Black and White women, such as studies that rely on SEER without linked Medicare data (Aizer et al., 2014;Iqbal et al., 2015). In the statistical causal inference field, many have argued that race should not be studied without consideration of the variables or societal attitudes that can cause differences among racial subgroups (Greiner & Rubin, 2011). Without accounting for potentially confounding variables, examination of racial differences can potentially exacerbate negative attitudes about race and hinder targeted efforts to end discrimination.
By using linked SEER-Medicare data, it is possible to examine confounders and better isolate reasons that racial differences in outcomes persist. The findings of Silber et al. (2013) provided valuable information that much of the difference in racial outcomes in breast cancer could be explained by presentation differences prior to 2006. The clinical stage and aggressiveness of the disease at diagnosis seemed to be driving health disparities. We similarly found that differences were largely attenuated after controlling for presentation variables, and the addition of traditionally derived treatment variables did not further change relationships of race with outcomes. By contrast, we found that the addition of augmented treatment variables closed the gap between the survival curves and indicated that there were no differences between groups after controlling for a more expansive list of HCPCS and ICD-9 codes.
One major difference between our work and others' work is that we did not screen our codes to determine if the related codes found during the augmentation process were truly reflective of the treatment categories into which they were grouped. Our algorithm would hence not be appropriate for investigations of treatment effects in which a treatment must be well defined, such as that undertaken by Petito et al. (2020). Our approach seems most appropriate in studies that seek to remove the confounding effect of variables. Indeed, treatment effects were not a primary interest of our paper, but controlling for the impact that they can have on inferences about how race impacts breast cancer differences was a goal. Hence, while our augmented treatment groups may not be interpretable as internally consistent treatment groups, they did capture broad ways in which the nomenclature concerning treatment can vary within claims.
Our method also involves the combination of subject matter expert evaluation of relevant claims codes with a more algorithmic approach in grouping codes. Often, the identification of relevant confounders in high dimensional data is seen as one of either using subject matter experts to narrow down the codes into relevant groupings, or using machine learning or Bayesian approaches to empirically select relevant codes (e.g. Spertus & Normand, 2018). Our approach combines clinical expertise with an empirical approach to categorize codes into intervention groups.
One limitation to our work is that the treatment groups we created were not necessarily meaningful. That is, some ancillary treatments, such as reconstruction, were grouped with mastectomy codes. Hence, our method might not be appropriate for investigating intervention effects in which the intervention itself is of interest. While we did provide preliminary validation of our empirically derived codes compared to human expert curation (Bai et al., 2019), additional studies are needed to more rigorously validate our algorithms.
The fact that the curves narrowed substantially after creating the augmented definitions suggests that there could have been racial differences in treatments chosen or how procedures were coded for billing purposes. Another possibility is that there were racial differences in the sequence of therapy. Many women who choose breast conserving therapy may need repeated operations due to findings such as positive surgical margins (Morrow et al., 2009). It is possible that the algorithmic approach was better at classifying treatments into groups that better captured such practice patterns. Future research can investigate why coding algorithms may differ between Black and White women and hence confound any differences in outcomes by race. It could be that there are true racial differences in treatments received, or it could be that similar treatments tend to be coded for billing and claims purposes differently between Black and White women.
We also found that the racial differences appeared diminished after controlling for presentation in the more recent data from 2006 to 2013. Although we did not formally test for temporal differences, many new treatments have been approved since 2006 that could potentially have affected survival, particularly for those with advanced disease (Cortazar et al., 2012). As this included a population eligible for Medicare, the introduction of U.S. prescription drug coverage through Medicare Part D in 2007 might also have improved access to prescription therapies in the later period.
Our method is similar to emerging hybrid artificial intelligence (AI) approaches that augment, rather than replace, human expertise with machine learning (Zheng et al., 2017). We used expert derived codes augmented by empirically found codes to better capture potential confounding between disease groups. In the case of SEER-Medicare data, hybrid AI is useful due to the large number of ICD-9/10 and HCPCS codes. There are often multiple ways to code the same event for Medicare reimbursement purposes. For example, excisional biopsy and lumpectomy may be used to describe the same tumor removal procedure. In such cases, hybrid-AI might assist researchers in identifying patterns of claims that reflect equivalent procedures. In the context of propensity score analyses, hybrid-AI can help expand the number of confounders used in adjustment. Besides claims data, hybrid-AI has been used in propensity score based analyses of geographic information system (GIS) data (Monlezun et al., 2021).
In conclusion, we proposed an application of a machine learning algorithm that uses the pointwise mutual information statistic to identify related codes when using Medicare claims data. By using this algorithm, we were able to control for a wider range of treatment patterns that potentially differentially affect survival differences between Black to White women with breast cancer. Similar to previous estimates, we have found that presentation differences between Black and White women closed much of the estimated survival curve gap. However, it is possible that treatment differences identified by our application could further explain racial differences in outcomes. Future work will be necessary to better explore the specific differences that may be contributing to health disparities.

Supplementary Material
Refer to Web version on PubMed Central for supplementary material.