{"id":3404,"date":"2016-01-19T07:52:10","date_gmt":"2016-01-19T15:52:10","guid":{"rendered":"http:\/\/www.cloudidentity.com\/blog\/?p=3404"},"modified":"2016-01-19T07:53:55","modified_gmt":"2016-01-19T15:53:55","slug":"estimating-gender-diversity-in-your-organization-with-the-azure-ad-graph","status":"publish","type":"post","link":"https:\/\/www.cloudidentity.com\/blog\/2016\/01\/19\/estimating-gender-diversity-in-your-organization-with-the-azure-ad-graph\/","title":{"rendered":"Estimating Gender Diversity in your Organization with the Azure AD Graph"},"content":{"rendered":"<p><a href=\"https:\/\/www.cloudidentity.com\/blog\/wp-content\/uploads\/2016\/01\/image.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-top: 0px; padding-left: 0px; display: inline; padding-right: 0px; border-width: 0px;\" title=\"image\" src=\"https:\/\/www.cloudidentity.com\/blog\/wp-content\/uploads\/2016\/01\/image_thumb.png\" alt=\"image\" width=\"567\" height=\"540\" border=\"0\" \/><\/a><\/p>\n<p>Your directory data holds a treasure trove of insights, which are now exceptionally easy to access thanks to the Graph API layer on top of it.<\/p>\n<p>Few weeks ago I was wondering what question I could answer about my own organization with the Directory Graph alone, and I quickly landed on a great candidate: gender ratios.<\/p>\n<p>I *love* working in our industry, but if there\u2019s something I loathe is the dramatic gender imbalance that is almost an absolute invariant everywhere I look. In my past as a consultant I often worked for non-IT shops, where the ratio wasn\u2019t as skewed, and loved the atmosphere \u2013 I can\u2019t quite put my finger on it, but activities and collaboration seemed to take a on a more balanced quality\u2026 <em>healthier<\/em> is the adjective that comes to mind.<\/p>\n<p>I know that the industry leaders are hard at work to correct this and many other imbalances, and the growing awareness of the problem gives me hope for the future. Still, I thought it would be fun to play with data and verify if some conjecture of mine were actually backed by numbers: are female managers more likely to lead orgs with more balanced gender ratios? Is it true that non technical disciplines have better ratios? Are there specific business functions where the ratios are reversed? And so on, and so forth.<\/p>\n<p>It goes without saying that I will NOT be sharing any data <em>whatsoever <\/em>about Microsoft here. Besides the fact that the estimates might be wildly inaccurate, it is absolutely not my place to talk about the company. Microsoft does an <a href=\"https:\/\/www.microsoft.com\/en-us\/diversity\/inside-microsoft\/default.aspx#fbid=q54_hg_gxF0?epgDivFocusArea\">excellent job in maintaining transparency on workforce demographic<\/a>.<\/p>\n<p>What I am going to share, however, is the <a href=\"https:\/\/github.com\/vibronet\/GenderMixEstimator\/tree\/master\">source code<\/a> and the methodology I used for running my little experiments. If you have Office365 or any other Microsoft cloud services, you can run the code in your organization and get back an estimate of the gender mix of the reports of any user of your choice. You just need to download the code, compile and run \u2013 that works even if you are not an administrator. For the time being you need Windows, but if there\u2019s interest I might port the app to .NET Core \u2013 which would allow you to run on Mac and Linux as well.<\/p>\n<p>Let me stress that I make no guarantees about the precision of the resulting estimates, not I guarantee that my code is bug-free. Please consider this simply as the chronicle of an afternoon spent geeking out with Azure AD, and my modest contribution for raising awareness around gender imbalance in tech.<\/p>\n<p>Ready? Let\u2019s dive!<\/p>\n<h2>The methodology<\/h2>\n<p>Where to begin? Let\u2019s see. With the Directory Graph, I know I can crawl through the entire report structure of anybody. I can get the User object of the manager of the org I want to analyze, for example via his\/her userPrincipalName: then I can recursively analyze all the User objects in the \/directReports property of all subtrees. So I can get all the users in the entire sub-org &#8211; that part is easy.<\/p>\n<p>How to tell the gender of a User? There is no Gender property readily available in the default schema.\u00a0 The obvious alternative is\u2026 the user\u2019s first name, naturally. We do have that, under the \/GivenName property.<\/p>\n<p>But wait, that\u2019s not that simple! There are certain names that are male in some countries, and female in others. For example, Andrea is <a href=\"http:\/\/www.paginainizio.com\/nomi\/nomidiffusi.php\">THE most common name<\/a> for a boy in Italy. At the same time, it is a super common female name in Germany \u2013 <a href=\"http:\/\/www.firstnamesgermany.com\/common-german-names\/\">in the 60s it was in the top 10<\/a>. Hence, we better include country information in our estimate of gender from first name. The \/country property exists in the Directory Graph, however it is often not populated. However there is another country-dependent property that is way more likely to be populated, and that\u2019s telephoneNumber. Never mind that Americans often tend to omit the country code! <img decoding=\"async\" class=\"wlEmoticon wlEmoticon-smile\" style=\"border-style: none;\" src=\"https:\/\/www.cloudidentity.com\/blog\/wp-content\/uploads\/2016\/01\/wlEmoticon-smile.png\" alt=\"Smile\" \/><\/p>\n<p>Let\u2019s take a step back and see where we\u2019ve got so far. Our estimate of whether a User is male or female is going to be based on the User\u2019s GivenName and Country(telephoneNumber). Sounds promising, but this is by no means perfect.<\/p>\n<p>Even if we manage to obtain the country\u2019s information from the User\u2019s telephone, all we are getting is the country from where the user is operating. An Andrea working in USA might be a male migrant from Italy, or a female from 2nd generation German migrant family. Any estimate should take into account what cases are most prevalent for each name and given country, which brings us straight in the realm of frequencies (and saddles us with the task of finding a source of info for those figures).<br \/>\nCompound this with the fact that there are some names which stubbornly defy classification: Robin, Kim, Casey, Yi, Rama, Jamie and so on. Those cases, too, point to the need to base our estimates on probabilities rather than static classifications.<\/p>\n<p>Here there\u2019s the idea that made me feel very clever, at least for few minutes: what if I\u2019d just crawl Facebook\u2019s public data and build a database of first names per country, tracking the frequency with which users self-declare their gender? That would by no means carry any guarantee of significance, but it would definitely be an improvement over static analysis!<\/p>\n<p>Hitting the internet for inspiration, I soon stumbled on <a title=\"https:\/\/genderize.io\/\" href=\"https:\/\/genderize.io\/\">https:\/\/genderize.io\/<\/a> \u2013 an awesome public API that already did the crawling for us, across major social networks, and helpfully exposes its database through a super convenient API. The API offers a very generous daily allowance of 1000 free name queries per day. I wanted to do some heavy duty work (and a lot of debugging <img decoding=\"async\" class=\"wlEmoticon wlEmoticon-smile\" style=\"border-style: none;\" src=\"https:\/\/www.cloudidentity.com\/blog\/wp-content\/uploads\/2016\/01\/wlEmoticon-smile.png\" alt=\"Smile\" \/>) hence I bought the PLUS package (and I am now super worried about looking silly by leaking the key on GitHub!) but I am sure you can do a lot of experimentation with the free tier.<\/p>\n<p>The API also offers a helpful probability associated to each gender estimate, plus the number of entries from which the estimate was extracted. That allows you to place confidence thresholds in your own evaluations if you so choose. Here there\u2019s an example. Say that we want the estimate for Andrea in Italy. The call is very simple:<\/p>\n<div id=\"scid:C89E2BDB-ADD3-4f7a-9810-1B7EACF446C1:a27c96c2-d8ac-4ae2-ad6e-4624921aaefa\" class=\"wlWriterEditableSmartContent\" style=\"float: none; margin: 0px; display: inline; padding: 0px;\">\n<pre style=\"white-space: normal;\">[sourcecode language='text' ]\r\nGET https:\/\/api.genderize.io\/?name=andrea&amp;country_id=it\r\n[\/sourcecode]\r\n<\/pre>\n<\/div>\n<p>looks like the following:<\/p>\n<div id=\"scid:C89E2BDB-ADD3-4f7a-9810-1B7EACF446C1:02bfb5cb-f1d9-49b1-9460-932869f8ddde\" class=\"wlWriterEditableSmartContent\" style=\"float: none; margin: 0px; display: inline; padding: 0px;\">\n<pre style=\"white-space: normal;\">[sourcecode language='javascript'  padlinenumbers='true']\r\n{\"name\":\"andrea\",\"gender\":\"male\",\"probability\":\"0.99\",\"count\":1070,\"country_id\":\"it\"}\r\n[\/sourcecode]\r\n<\/pre>\n<\/div>\n<p>That\u2019s a pretty high estimate! That\u2019s not a 1.0 probability given that we do have various Andrea from Germany or other countries living in Italy \u2013 I know a few myself. Let\u2019s check Andrea in US tho:<\/p>\n<div id=\"scid:C89E2BDB-ADD3-4f7a-9810-1B7EACF446C1:0fa46314-7477-4b08-937c-324168e8b547\" class=\"wlWriterEditableSmartContent\" style=\"float: none; margin: 0px; display: inline; padding: 0px;\">\n<pre style=\"white-space: normal;\">[sourcecode language='javascript' ]\r\n{\"name\":\"andrea\",\"gender\":\"female\",\"probability\":\"0.97\",\"count\":2308,\"country_id\":\"us\"}\r\n[\/sourcecode]\r\n<\/pre>\n<\/div>\n<p>Yep. It looks like the \u201cAndrea\u201d in the US are mostly from countries where the name is female\u2026 or at least, that\u2019s what people self-declare on social networks. Of course, for being precise we should also take into account any gender differences in social network usage\u2026 and compare the count to the total sample per country. But let\u2019s not go too far <img decoding=\"async\" class=\"wlEmoticon wlEmoticon-smile\" style=\"border-style: none;\" src=\"https:\/\/www.cloudidentity.com\/blog\/wp-content\/uploads\/2016\/01\/wlEmoticon-smile.png\" alt=\"Smile\" \/> what we have now seems good enough for playing.<\/p>\n<p>Before I move to describe the (simple) app that implements the above, I want to call out one last detail. Parsing phone numbers is a surprisingly complicated task, given that every country has its own formatting rules. Luckily there are a number of libraries that can help, the most famous probably being Google\u2019<a href=\"http:\/\/code.google.com\/p\/libphonenumber\/\">libphonenumber<\/a>. Patrick M\u00e9zard and Aidan Bebbington nicely <a href=\"https:\/\/github.com\/aidanbebbington\/libphonenumber-csharp\/tree\/master\">ported it to C#<\/a> and made it available as a <a href=\"https:\/\/www.nuget.org\/packages\/libphonenumber-csharp\/\">NuGet<\/a>, which I promptly used in the project. This NuGet is the main reason for which I did not write this directly in .NET Core \u2013 out of laziness, really. But, if there\u2019s interest we can always course-correct and port it!<\/p>\n<h2>The app<\/h2>\n<p>The <a href=\"https:\/\/github.com\/vibronet\/GenderMixEstimator\/tree\/master\">application is a simple console app<\/a>, which can be found in this repo. At launch, it takes in input the userPrincipalName (which can be different from the email, beware)\u00a0 of the manager whose org you want to analyze. If it\u2019s the first time you launched the app (or it\u2019s some time you don\u2019t run it) you\u2019ll get prompted for credentials \u2013 make sure you use an account that belongs to the directory you want to work with. I chose the console app format for 2 reasons:<\/p>\n<p>&#8211; It can be modeled as a native client, which is automatically multitenant and can access the directory as the signed in user. That means that the app requires no setup in your directory, and does NOT require an admin to function. That\u2019s pretty much equivalent to a user running an LDAP read query in a classic onprem AD. Modeling the app as a multitenant web app would have required admin consent for gaining directory read rights, which would have greatly limited the number of people that can run this analysis.<\/p>\n<p>&#8211; It has no UX requirement, which makes it runnable on headless boxes \u2013 and above all, makes it easy to be ported to Linux and Mac via .NET Core. You are going to do your analysis in Excel anyway, hence even if I would have thrown in a couple of pie charts I would not have added much real value to the insights you can get from the textual output.<\/p>\n<p>Once it gets a valid token, the app caches it for future uses (in a file, token.dat, that can be decrypted only on the machine it\u2019s been generated on \u2013 but be careful with it anyway). Then it passes it to a factory for the class Account, a wrapper that uses the Graph API to retrieve the first name and telephone number (hence, the country) of the manager.<\/p>\n<p>That done, the app calls the Account class method GetGenderMix \u2013 which<\/p>\n<ul>\n<li>Assesses the gender of the Account, by passing the first name and the country to genderize.io. This is all done via a proxy class, GenderizeProxy, which handles indefinite cases and, above all, caches results so that subsequent estimates of a known name-country couple won\u2019t result in a network hit. Before doing that I verified that the ToS of genderize.io does not prevent that, and those guys are <a href=\"https:\/\/store.genderize.io\/faq\">super chill<\/a> about any use. Of course I would not do this if the cache would be shared across multiple users, as it would be the case in a web app, but here every machine running he console app would build its own cache \u2013 hence it still seems fair.<\/li>\n<li>Retrieve all the reports of the current Account via Graph API and the navigation property \/directReports. Then, call itself on each report.<\/li>\n<li>Once the recursive calls have exhausted their run, aggregate the gender figures in the Males\/Females\/Undefined\/Contacts accumulators. \u201cUndefined\u201d is anuthing that didn\u2019t lead to an estimate, whether because genderize.io didn\u2019t have any record of the name or because the estimate quality didn\u2019t meet the bar (see below about thresholds). \u201cContacts\u201d is a special case in which a Contact entity (as opposed to a User) is returned as a report \u2013given that the properties there are different and the case isn\u2019t all that frequent, I don\u2019t use it in the final male\/female tallies.<br \/>\nNote that for the way in which the classes are structured today (an access token is passed in, instead of being obtained via AcquireToken* in Account) the execution time of GetGenderMix cannot exceed 1 hour, the validity window of an Azure AD issued access token. Again an easy fix that I didn\u2019t put in out of sheer laziness and desire to see things working ASAP <img decoding=\"async\" class=\"wlEmoticon wlEmoticon-smile\" style=\"border-style: none;\" src=\"https:\/\/www.cloudidentity.com\/blog\/wp-content\/uploads\/2016\/01\/wlEmoticon-smile.png\" alt=\"Smile\" \/><\/li>\n<\/ul>\n<p>One thing to notice about GenderizeProxy is that it allows you to specify confidence thresholds, like ProbabilityThreshold (capping the confidence level beyond which an estimate will be considered indefinite instead of the proposed gender guess) and CountThreshold (establishing the minimun number of entries from social networks an estimate must be based on to be considered reliable). Playing with those thresholds can change the numbers pretty significantly, although in my experience the ratios often remain surprisingly stable.<\/p>\n<p>Once GetGenderMix finished, the app spits out some generic results on the console, and saves a CSV file with all the gender mixes for all the managers in the org you analyzed. Just double click on it, and have a ball in Excel to find correlations and interesting numbers (hint: I found AVERAGEIF super useful).<\/p>\n<h2>Give it a Spin<\/h2>\n<p>As it should be abundantly clear at this point, this <a href=\"https:\/\/github.com\/vibronet\/GenderMixEstimator\/tree\/master\">little app<\/a> is by no means guaranteed to offer a precise assessment of the gender mix in your organization. It certainly doesn\u2019t hold a candle to what your HR already knows with zero uncertainty margin, and always keeping that in mind is certainly a healthy thing.<\/p>\n<p>That said, I *love* how empowering this thing is. As mentioned, I personally used it for verifying some theories I had \u2013 in some cases they did pan out, in some others I was surprised to be proven wrong, but the metapoint here is that they got me to <em>think<\/em> about the problem. I hope this will get you to think more deeply about gender inbalance too \u2013 and if you learn how to play with Azure AD in the process, I can\u2019t say I\u2019ll be disappointed <img decoding=\"async\" class=\"wlEmoticon wlEmoticon-winkingsmile\" style=\"border-style: none;\" src=\"https:\/\/www.cloudidentity.com\/blog\/wp-content\/uploads\/2016\/01\/wlEmoticon-winkingsmile.png\" alt=\"Winking smile\" \/><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Your directory data holds a treasure trove of insights, which are now exceptionally easy to access thanks to the Graph API layer on top of it. Few weeks ago I was wondering what question I could answer about my own organization with the Directory Graph alone, and I quickly landed on a great&#8230;<\/p>\n","protected":false},"author":1,"featured_media":3401,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_kad_post_transparent":"","_kad_post_title":"","_kad_post_layout":"","_kad_post_sidebar_id":"","_kad_post_content_style":"","_kad_post_vertical_padding":"","_kad_post_feature":"","_kad_post_feature_position":"","_kad_post_header":false,"_kad_post_footer":false,"footnotes":""},"categories":[1],"tags":[],"class_list":["post-3404","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/www.cloudidentity.com\/blog\/wp-json\/wp\/v2\/posts\/3404","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.cloudidentity.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.cloudidentity.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.cloudidentity.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.cloudidentity.com\/blog\/wp-json\/wp\/v2\/comments?post=3404"}],"version-history":[{"count":2,"href":"https:\/\/www.cloudidentity.com\/blog\/wp-json\/wp\/v2\/posts\/3404\/revisions"}],"predecessor-version":[{"id":3407,"href":"https:\/\/www.cloudidentity.com\/blog\/wp-json\/wp\/v2\/posts\/3404\/revisions\/3407"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.cloudidentity.com\/blog\/wp-json\/wp\/v2\/media\/3401"}],"wp:attachment":[{"href":"https:\/\/www.cloudidentity.com\/blog\/wp-json\/wp\/v2\/media?parent=3404"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.cloudidentity.com\/blog\/wp-json\/wp\/v2\/categories?post=3404"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.cloudidentity.com\/blog\/wp-json\/wp\/v2\/tags?post=3404"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}