As you have, you know, you can your visible features like skin tone, hair length jewelry, things like this, that are, you know, correlate with different demographic groups in spoken language you don't have that right but you do have regional accents and dialects. So like what groups you're going to kind of audit for, for lack of a better term and what kinds of datasets you're going to curate to do that assessment could vary radically by different services. There was debate you know about whether we should release information like okay, you know, on this service, the worst performing group among these demographic groups was such an such group. And I think there's two very good reasons that in the end, we decided not to do that at this point. One is is that the honest scientific truth is that the identity of the group with the worst performance and what that worse performance looks like, can vary radically from dataset to dataset. So it really can be the case that just on the problem with speech recognition, you know, different benchmark data sets, the group that is the best performing and the work group that has the worst performance can completely change from one dataset to another. The other comment I think it's worth making is that, and this is, again, some sort of a lesson I've learned at Amazon. The vast majority of kind of fairness notions in the scientific literature on on the topic, essentially adopt some kind of equalization of harm notion. It's like, okay, we're building a model for consumer lending. We think the biggest harm is like a false is a false negative I, I predicted that you will not repay a loan and so I don't give it to you whereas in fact, you were credit worthy and would have repaid it and so then I settled it something like okay, across these different combinations of racial and gender groups, I wants to equalize the false rejection rate across different groups. Okay. And I think we hopefully don't think that way within AWS and the reason for that is a couple fold. First of all, it can just be the case, right, that some groups present a greater challenge on a particular problem to another group. And so if you insist on equalizing rates of harm across different groups, it could be that the only way you can achieve that is to deliberately do worse on groups that you're doing better on in order to raise their rate of harm up to match that of the worst performing. What's an example of that? I think the simplest example would be, it may not always be this way, but in general things like you know, facial hair, and sunglasses, presents a challenge to face recognition, right? Because there's some kind of occlusion of your underlying facial structure. Okay. That may not always be the case. By the way, maybe at some point, we'll figure out ways for instance of, you know, detecting bone structure better in a way that would let us kind of see through facial hair. But to the extent that it's sat makes sense to people that right now facial hair makes face recognition more difficult. If there's a culture or a demographic group in which that is common. It's going to be that it's going to be a hot it's a harder challenge from a scientific standpoint. So the view we adopt instead rather than saying like, well, you know, successes when we equalize the error rates across groups. Our goal is to make every error group error rate as small as we possibly can, even if that doesn't mean that we can equalize all of them and we don't want to do the sort of nonsensical thing from a product and performance standpoint of, you know, in the interest of some academic notion of fairness, deliberately doing worse on some group. And so the technical work that goes under that, of course, involves you know, you find out what your worst performing group is, and usually not always, but usually, the best solution to get an improvement on that group is to go out and get better and more data for that particular group. But it's because of these two reasons. One is that we don't think in this equalization term, and also what the worst and best performing groups are can change radically from dataset to dataset. We give kind of high level guidance on like what the worst performing group number was, but without sort of saying this was the specific group that witnessed that number, you could have provided additional information and specified the dataset why? Did you choose not to do that? Yeah, there was also a healthy internal debate about how much say about datasets and I think in the initial cards that we're releasing today, we say relatively little about that. Part of that is because you know, first of all, many, many datasets go into the training of our models, as well as the assessment and, you know, quite often, there's many more assessment datasets in the workforce, or at least they're very designed to be different, right? Because you're essentially trying to do stress tests of models. So you know, you normally would expect to get very good performance on the type of data that you trained on. But when you start stress testing, different use cases, things will deliberately will look worse. I think there was also the fear that since so much goes into the training of an ML model and your technical viewers will know this, you know, the cartoon view of machine learning is that it's a very streamlined almost button pushing process, right I get a data set you know, and I, I put some button in pytorch Now comes my model and great but you know, that like just, I don't, I don't think I'm giving away any big secret, at least among the scientists to your viewers that the amount of artisanal tinkering, that goes into modern machine learning is just mind boggling, and in many ways is actually kind of increased with the rise of deep learning because you know, there's what's the architecture, how deep is the network, how wide is the number of what exactly are different activation units? What is the architecture between layers, do you have convolutional units, etc, etc. And the honest truth is, is that you know, even though we have rigorous and effective train test methodology, the way the soup is made, is there's a lot of trial and error and you know, yeah, very things and so, sort of releasing just information about the datasets without sort of the This transcript was generated by https://otter.ai