With growing legal and scientific evidence for the importance of reducing model bias, both model developers and deployers need tools to quantify the bias. Unfortunately, algorithmic bias can take as many forms as there are implementations. In this talk, Paul M. Heider covesr a range clinical NLP use cases like de-identification and predicting diagnoses, highlighting the utility of behavioral testing and comparative evaluation methods to identify the scope of a model’s bias. These approaches can be leveraged at the training, testing, and evaluation stages to benefit researchers doing de novo model development and community members tasked with choosing between multiple third-party models to deploy.