DataFramed
DataFramed

Episode · 5 years ago

#5 Data Science, Epidemiology and Public Health

ABOUT THIS EPISODE

Maelle Salmon, a data scientist who has worked in public health, both in infectious disease and environmental epidemiology, joins Hugo for a chat about the role of data science, statistics and data management in researching the health effects of air pollution and urbanization. In the process, we'll dive into the continual need for open source toolbox development, open data, knowledge organisation and diversity in this emerging discipline.

In this episode of data framed, a data camp podcast, will be looking at how data science is impacting epidemiology. I'll be speaking with my l salmon, a data scientist who has worked in public health, both in infectious disease and environmental epidemiology. We'll be talking about the role of data science, statistics and data management in researching the health effects of air pollution and urbanization. In the process, will dive into the continual need for open source toolbox development, open data knowledge, organization and diversity in this emerging discipline. I'm Hugo bound Anderson, a data scientist at data camp, and this is data frame. Welcome to data framed, a weekly data camp podcast exploring what data science looks like on the ground for working data scientists and what problems data science can solve. I'm your host, Hugo bound Anderson. You can follow me on twitter at Hugo bound and you can also follow data camp at data camp. You can find all of our episodes and show notes at Data Campcom community podcast. Hey, male, welcome to data framed. Hi, thanks for having me. It's such a pleasure to have you on the show. I'm really excited to be delving into the role of data science in epidemiology, thinking about the our stats community, the role that the our community plays in data science in general. Today with you, but first I'd like to find out a bit about you. How did you get into data science originally? Well, I've always liked mass but a row I was really interested in to biology, applications of maths. So, which why I studied out as a biology major and I really thought I would be, you know, a lab person, but when I did some lab work and I hated it, so I realized I did it to spend more time in front of a computer, and that's how I discovered epidemiology, which is a way to look at life sciences application, use a lot of math and not do lab work. When you were doing biology, you realize that, you know, either by petting or spending time in the lab, wasn't wasn't what you enjoyed the most, so you started...

...programming and doing data analysis and that type of stuff. Yeah, exactly. And and this applied to epidemiology. And after doing that, like after doing my master's in public health, I looked for a PhD and I thought they would be nepide multi PhD, and it was actually a PhD in statistics, which I hadn't done a lot before because I had mostly use modeling, mathematical modeling, in my internships and research projects. I was like curiospet statistics and it actually worked well. So I got my Ph d and after that I looked for new positions and the one I was very interested in and that I got was a data manager and statistician position. I wasn't the data manager at all, but that's when I learned to data management pattern, I guess, statistics data management. That's when I became a data scientist for real. That's cool because I don't know if you know much about where I came from, but I did my post doc in a cell biology lab. My backgrounds in pure math, started doing applied math and I was hired as a postdoc ostensibly to do applied math mathematical modeling, but that's where I got there and realize I needed to start doing a lot of data analysis, a lot of statistics, and that's what I learned. Are and Python as as well formally. So that's there's a similar trajectory there. So you didn't originally think of yourself as a data scientist at right. No, I did that, I thought of myself as statistician, because they work. I had worked so hard to get my pigeon statistics, you know, sort of getting this statistician title was important for me. But when I did a lot of data running and for women's international date, David Robinson sent up this tweet for a series of tweets, for it was like kind of making an advertisement for women in data science, in statistics, and it put my name in a tweet about data scientist, which was first like a I was very happy and thankful for that, but also very surprised, and that's the first time I realize, Oh, yes, I'm a data scientist and actually you have some you have a lot...

...in common common with Dave, and one thing I admire about both of you is how much you like to communicate around data science and to blog. So maybe you can tell us a bit about that. Yeah, I think I started blogging to the I had all these small projects I had done for fun, like but there were guitub repos and I in some cases I get tweeted visualizations, but I had never taken the time to really explain how I had done this things and I thought they would be great to blog about it, to take this all projects and just put them in into block post. And actually I was very inspired by what day've did and also what Julius slog did, as are blowers. I really want you to have fun blogs as they do. So where am I inspiration? So let's let's talk about epidemiology. What is epidemiology? But a lot of things. Or officially it's when you study is determinants of disease. So what causes disease? And obviously terminals and our disease are distributed in a population. So it's a science, but it's a very applied one, because the goal of doing all this studies is to control disease, to ink to improve the health of a population, and this population may be a human population, animals population, plant population. So you're really trying to improve the health of a living thing, but not necessarily humans. So well, there are different types of epidemiology. For example, infectious today is epidemiology, to my understanding, versus Non Infectious Disease Epidemiology. Can can you speak to this distinction? Yeah, exactly. So sometimes when people say pemiology they only think of the movie contagion. Services in Texas is epidemology. So you're looking at, I don't eb all as someone Ella is kind of disease, but they are all. There is also another kind of epidemology which is looking at, for instance, cadirascular disease, which is definitely not infectious but which still need to be studied. So when you do epidemiology you might be doing any of his things, and while you might be doing this for humans or not for humans. So it's a very varied field. Yeah,...

...and I suppose I basically is another example. Right. Yeah, and I'm glad to mention that because I was first told it's a borrowing issue. When I started being interesting in epidemiology, I needed to talk with Professor at my university that was in charge of helping me with my curriculum, and when I told this person about epidemology, this person told me, Oh, yes, do that, but don't do an emergy of orbasity because it's borrowing issue. So that's what this person told me. I was quite shocked. So so it's not true. There is no borrowing issue in epidomology at all. There is nothing with completely figured out. And if you're not interesting to in obesity, that's fine, because they are a lot of over things to study. Yeah, and so something I'm I'm not quite sure if I'm correct or not, but my understanding is even with noninfectious epidemiology there are there can be network effects, for example with obasity. If you're in a social group with a certain number of a base people, there's a higher probability of you, you being obase. Yeah, exactly. So, if it emlogy is a very interesting field because it's not only about biology, it's also about alt social factors, for instance, that can influence your health. So it's very varied in that sense and it seems like it would be a very motivating place place to work because you're doing something to improve, improve people or other organisms health. Right. Yeah, exactly, that's what I think. I think it's it's very motivating because you're doing something good, or you're trying to at least is to improve the health of I mean global health, health of humans and health of animals. So it it's very motivating in that sense. So one of the most interesting challenges in epidemiology today, in the modern landscape of epidemiology. I think it's an interesting question because I I'm not I'm not sure any change is more interesting than another. But something better I read like to mention is that you can actually prioritize which research needs you're going to to answer. There is this global project called Global Burden of disease. So they're collecting data,...

...modeling when data well, modeling to do set signal is will some madling when data is not enough, when data is to scarce. And what these people are doing is making a list of the most important causes of mortality, for instance. So right now you can't say, for instance, if Malaya is an important cause of mortality and if it's effect is decreasing or increasing. To using Global Betton of disease data, you could actually choose an epidemogy project fats, that seems more crucial because it affects more people or because it's increasing great and is this all open data that they have? Yeah, I'm not too sure. Wherever the Road Ata is open. But they have a fantastic website where you can play with data visualization. So they're doing great work at communicating their findings, so in scientific journals, but also on this website with asualization and with interactive tools where you can really visualize and play with the data. To really understand the finding for yourself. Yeah, that's really interesting and I think the role of interactive visualization, Interactive data visualization, is becoming more and more exciting. Yeah, and actually I use a lot of interactive is relation myself. I mean, I don't play with them a lot, but I remember someone telling me about the global burning of these is projects and epidemiologist and telling me, and what I really like about them is this and showing me an interactive visualization. So I think they have really chosen a great way to communicate what they do. So what are some other challenges in epidemiology that you currently find find interesting or you're thinking about? Well, so, I see by the gooble burner of disease helps you see what are the most important cause of mortality and disease. And if you compare that with existing research, with results that I've already been published, you can see that very sometimes a gap. So if you look at the air pollution, at the effect of air pollution, for instance. So most research has happened in eye income countries,...

...but most exposure and while most people live currently in low and middle income countries. So I've worked for research project called Chai. This means caduascal health effect of air pollution in Telangana, India, and this is one of the research projects currently but hims at breaching the gap between existing research and with much need we have in the fill of air pollution amazing. And so does that involve collecting more data or what type of what type of things is try doing to bridge these gaps? So chails in collected a lot of data because in the place like India, so it's not like if you're looking at our pollution and say the US, you can partly rely on data collected by the EPA, which you cannot really do in India because they are less monitors. So it data collection is there riscrs and not necessarily open. And because Chi is happening in our rule area, there was no help pollution monitor for ambient air pollution. So and so Chai set up, I'm on air quality monitors and on top of a Vera's a lot of data collection from people. So having them where, for instance, personal air pollution monitors, having them feel questionnaires. So a lot of data was collected from the environment and from people themselves. And so data collection is one thing, but then I suppose data management is a huge challenge when we're collecting so much data. Yeah, so, as you can imagine, we are. It's a lot of data. So you will necessarily have a few mistakes and a few things that don't really correspond to one another. So it was important to clean this data. So, for instance, we were using a very cool personal air pollution monitor which was small and which gave you one really regular measures of a people personal exposure to small particles in the air. So this is a cold device, but it's actually not completely production ready. So so it had a lot of issues in...

...the data. So we had to spend friends a lot of time looking at this, at this data and cleaning it before using it. So this project in particular was in a in a rural area. You said, what's the role of urbanization in thinking about epidemiology? So because of urbanization or so, a lot of environment factors are changing into people live. So this happens in India, but this happens in over countries. A lot of people are living in cities. So I'm going to branch out from hi and mentioned another project I've worked for, which was looking at our city infrastructure may change people's health. So the way your city is organized, well in from the way you go to work, for instances, the way you're commuting, and this project was looking at the influence of putting more cycle paths in your city and health. So it was very interesting because it's we had to model the influence of the length of bike path throughout in a city. Ill this influence the number of people that go to work on a bike with public transport in a car, and ow this in turn influence health. That's incredible. That's that's such a diverse set of tools and ideas and concept I mean you have city planning their you have health and you have data science, Data Management and Statistics all converging there. That must be a really exciting type of project to work on. Yeah, exactly, I know. It's also means that you're never you'd never feel completely expert in any of his things, but you're learning a lot and that's why it makes it so motivating. Yeah, and that's something people data working, data sign as always say to aspiring data scientists, is you need to be willing and have a strong desire to be learning constantly, because you'll be working in domains which which are challenging you. That you may not have so much experience in as well. Right. Yeah, exactly like when...

I started working for Chai and never done data managements or a released from quality. You need to have as a data se and is the willingness to learn, but also the capacity to know where to look for resources to do things right or to learn how to do them right. Yeah, in general, how did data science and Statistics help to solve challenges in epidemiology? Well, if it image is really like a evidence based fields, when city managers are like public health people, make decision, so often they will use evidence from epidemiology and because of an epidemogy Rell eyes and a lot of data analysis and statistical statistical work. So that's what statisticians and data sets are doing in epidemiology. Something that interesting if you look at statisticians. They like to develop fancy methods for the for the doing things that when you have epidemog is having that issue an important pont that we'll really make statistician make a difference in a bit emogies when they're actually publish fair work as a soft as a software, as something that epitomogy can use. So I think an important part of the work of data sns in epitemogi is providing software tools for epidemogies to ven use. Yeah, so that would involve not just publishing a result about some convergence property in a statistical pipe up with one use case, but actually getting open source code out there in the real world and hopefully getting a bunch of users trying out on a variety of projects. Right, yeah, exactly. And actually, when you're writing such a paper, like a mess paper, you actually write code because you're you're testing your missles for yourself or on a similated data set, on a real world data set. So you've caught somewhere on a computer. The real effort you need to make is band documenting this code and like, for instance, you, if you think of our putting in into a package documented, the documenting it and testing it. So that's and this...

...part is crucial, but it's not rewarded a lot in the academic system. But if it's doesn't happen, your messles won't be used by real world people and won't make any difference for epidemiology. And so actually, in your in your thesis, you have a great quotation which which speaks to this. Right. Oh, yeah, right, I was actually actually asked by man supervisor during the last week of my thesis writing to add a quote at the end of my thesis and I actually panicked that they had no idea what to write. You know, I wasn't in no time to be inspired. I was really stressed by writing my thesis. But when I thought on one of my Tabers, statisticians, we sadly died this here and wrestling. So it is a Swedish statistician and he said, let my data set change your mindset. So that's say, but you score, because he's using a lot of data to show P to why he was using a lot of data to show people how the world was, which is different from what its or from our prejudices when we don't think we've data. And what I said is but having the statistical tool was a different version of his sport. So I said, let my tool set change your mindset about the data set. So let me show you everything you can do with your data if you just apply this Messoud. So your data is more valuable and you think, if we use the tools, that's amazing. So you altered let my data set change your mindset to let my tool set change your mindset about your data set. I love it. It's now time for a segment called blog post to the week with data camp curriculum laid spends a Bouche. I. So, Spencer, you're here today to talk about a blog post that you read this week and loved. Yes, this week I've got a blog post called project oriented workflow by Jenny Bryant. Jenny's a software engineer our studio and a professor at the University of British Columbia. Well, will include a link to this in the show notes. What's The post about? Well, so when Jenny gave a talk at...

...an our conference in New Zealand on the topic of using our efficiently, two of her slides in particular generated a lot of heated discussion, so much so, in fact, that today is blog post of the week was born. The crux of Jenny's post is essentially a call for analysts to clearly delineate workflow from product in their analyzes. So how does Jenny draw the distinction between workflow and product? Okay, your workflow incorporates all of your personal preferences when doing data science. So that's your text editor, Home Directory. Any convenience functions that you like to use in your interactive console, etc. The product, on the other hand, is the data and analysis itself, including everything necessary for anyone else to run your analysis and get the same result. So how can workflow and product get mixed up? Okay, so Jenny works in R so I'll use her our examples, but the same ideas applied to python and any other language that you use for data analysis. One really common example is changing your working directory explicitly in your scripts. This may work fine on your computer, but it will singlehandedly cause your script to fail anywhere else. And what's another example? Sure, although code in your our profile is a great way to customize your working environment, you shouldn't put things there that affect the behavior of the code in your script. So, for example, changing the number of rows of output when you print a data frame in your interactive console, that's fine, but changing the way our handles an a's or factor variables by default in your art profile will cause errors or, even worse, inconsistent results. So why is this a big deal and why should we care? Well, ultimately it all boils down to reproducibility. Most of us have experienced the pain of receiving an analysis from a data scientist that didn't properly separate workflow from product, and if you haven't, you will some day. Don't be that person. If you follow the golden rule of separatinging workflow and product,...

...your analyzes will be more accessible to others, meaning better feedback from co workers for you and more exposure for whatever your awesome side project is. So how, when doing analyzes, can now listeners out there and make show that the properly separated workflow and product. Here's a couple of Jenny's recommendations. Don't use a single R process across all of your analyzes. Keep each analysis as a separate project with its own R process. Also periodically run your script in a fresh our process from scratch, just to make sure it's in a fully runnable state. So definitely go check out Jenny's blog post. In the show notes, you'll be able to implement her suggestions and probably just a few minutes. Your Life will get easier overnight and your analyzes will be more sharable for the rest of your career. Thanks, spenser, for sharing your blog post of the week. Thanks for listening. Everybody. A time to get straight back into our chat with my l and so I want to jump back into talking about epidemiology, but actually before that, because I know you're so passionate about tools and software development for practical statistical data size epidemiology projects, I thought maybe you could speak to it, to your role in in our open side, for example. Oh yes, Sir o, Open Sigh is a community of researchers and developers that provide tools for open science and reproducibility, and I got involved this committee when I saw, well, like I think two years ago now, some part of the tools that are open say has add lot by staff, but another part of his tools, a big part of them out of loot by community. But submit Vera packages to our open side, and I open say and sure a good quality of software by having an unboarding process. So when you want your package to become part of our open say, you need to have been on boarded, and what happens is in open and transparent review of your code and documentation by two independent reviewers. So all of...

...this happens on Github, using guidelines. So what I did two years ago was submitting my package and I wasn't too sure what to expect, and what happened was a very friendly and helpful review of a package and I got more and more involved, and I'm not a co editor for on boarding. Yeah, and so I help ensuring this process goes well and where we're ready ensure the quality of software in our open sie suite of tools. That's incredible and I think it actually feels well. Clearly it feels such a huge gap that's left open, you know, by the world of scientific publishing and scientific journals. Right. I mean essentially, one way of viewing some of the work at our open side is it's it's a peer review process, right. Similarly for scientific results that come come out in journals, this is a peer review process for for software to help scientists do their job. Yeah, exactly. And and it ready increase a focus on on your software quality. So when you publish a paper as a statistician, no one is going to look at your code, I mean most often. So maybe you have a huge mistake in there and no one is ever going to find it, and having software review is a way to prevent this problem from happening. So our open side is part of the solutions. Is a problem of cross they should be more software review in general and, interestingly, at our open side, we have partnership with to scientific journals, because when you submit your package to our open side, you're not actually publishing it as an academic but because of his partnership with the Journal of Opens Ay, a software method in ecology and evolution, you can actually have your software reviewed and get like a Dii, you know, something that you can put on your academic cv and that's more valued into currency system. And you recently wrote a blog post about writing our packages. Yeah, so I was invited at our N echology Hackathon to give my top...

...tips for developing software packages that are high quality and user friendly, which is a bit daunting, and we're a lot of good resources for doing that out there. So my block post is a summary of always tool that I know that I've used all the resources like books and block posts, but one can read to improve software writing and I think it's important to have such memries. But also I was asked to give my top tips, which suppressed me because I don't feel like an expert like two years ago I submitting my first package, and I really hope that adding such things like my black post holdware are people finding resources to build their skills, because I think that if I was hereful to learn all these things. Are Part of these things. In two years, when a lot of people wouldn't think variable, to actually able to learn more about soccer quality. And this speaks to a more general notion of Knowledge Organization as well, which I know you're passionate about when thinking about the role of Statistics and Data Science in Epidemiology, ecology, air pollution, insurance, mats, what whatever it is, and I know that the idea of organizing knowledge is something you think a lot about. So maybe you can tell us a bit about that. Yeah, so, as a data saint is often as a statistician, you will be the person, but will from lay the issue someone as like an applied person in mathematical terms. So but that you will make it more abstract and look, you will get from a mode or whatever, which is a way you will find resources from overfiels, from, I know, pure mathematic TRU statistics to solve the issue. For instance, what happened to me during my pgcs this I had I was seeing something in my data with the distributions I was using, and my use case was reporting delays. So the fact that when someone gets sick with say, Salmonella, the National Public App Institute doesn't know of it right away. So vary...

...a time, like various delay between this person getting sick and back case being reported. So this is one use case. And I had this distributional issue, is statistical issue. I was seeing something. I needed a theorem and a proof of a theorem and I needed to google a long time because I was using the keyword from my ibimushy field. And then, after a while I found exactly the solution to my issue, but in insurance mathematics and in actually in German book. So that was really I was really lucky to find that. And I guess that's why I mean by Knowledge Organization. So because sometimes different feels use different keywords. People might be doing exactly the same thing, the same statistical thing, and not knowing that this issue has been solved as where. So, as this statistician, you need to famulate the creation in statistical term and also you need to be able to branch out a bit and look in over fields to see what has been done. Yeah, and so as a community, how can we, a community of data signer statisticians, work towards better knowledge organization? I guess an important thing is that we have this community, as you say, and there is a single data sciands community, like there is not anypinimosition data siands community or insurance mathematic data sayence community. So people, because people talk to each other and and they should do a lot of that when they can see this powerl thing. So we really need to think of data sciences as a role and to talk to each other. Indis global community no match as a field where we apply data saience. And I'm actually really glad that you mentioned keywords and search engines and this type of stuff, because this is another piece of advice I always give to people starting out in data sciences. Search engines will be your best friends. Oh yeah, and in fact I did a I did a live coding session recently, was on facebook live actually, where there was a pandas function I...

...just couldn't couldn't figure out and I spent it was actually relatively awkward in the end because it took me like six or seven minutes to get there and I wasn't sure I'd be able to finish the project. And in the end the response was you go. That was some of the most valuable stuff. Seeing you really like freak out in a search engine, because that's the type is off we need to do when we work as data scientists. Oh yeah, but I was actually certiviss a data Kap and new instructor is your so I I for loot this training and that at part of the CAP untry is in sort of training they tell you that it's actually great if you do a mistake in front of the students. You shouldn't hide it. You should really embrace it and show them how you solve mistakes when you do want to use solve a book. So, speaking of how the INS and outs and you know workings of data signs and in your work, what type of data science or statistical tools did you find interesting in your bike paths project? So what happened with this backpath project? So that's what not my main predict about what happened is that I had this colleague who was working on citis infrastructure and she had data from six cities in Europe and she has a less she had the length of psychopath in these six cities and she was a bit unsure how to use them because cities defined cycle infrastructure in different ways, like you could have wane city saying all this road is safe. So this is cycle infrastructure, but it's not dedicated cycle infrastructures. So she has me do you think we could use open streetmap data to solve this issue? And I said Oh, sure, so I thought it was a fun challenge. So there are tools for getting open streetmap data and open streetmap as labels for the dedicated psycho path. So this happened for six cities. And once you've written the code for six cities, as you know, you can scale vis codes for any number of cities you Mont. You just need to change name of the cities. So what started as a as this question for six cities ended up by our getting data for more than one road city cities in Europe. So are really increasing...

...the quality of the modeling in in that project. So I guess the technique in bad work was mostly data wrangling and data donoading from open streetmap. Yeah, and this speaks to a point that we've has come up a couple of times in our conversation now. So it's worth focusing on for a second this idea of open data and how relevant and important and useful it is. What's your take on open data? Well, so for instance in back project, but loved open data. Without open data we wouldn't have got all this psychopaths data. And we also got a database that exists of the percentage of people in a lot of cities in Europe that go at commute but do their trips on a bike or in public transports. With data exist, and without these data it would have been more, much more difficult to help impact assessment of single cycle infrastructure using of air quality, which was so important to me when working on Shai and which still very important to me. It's we need to are quality data to assess to the effect that it doesn't heal and the effect of different measures. But it's actually very hard to get air quality data in many parts of the world and it's more complicated when we might sing. So is a city as a website with the current value of ozone. So it happens sometimes, but you have such a website, so you get the current value but you never get to pass value. So values disappear of the time. And what I discovered when working on Chai is a very cool initiative by Christas and COFUNDRAL flasher. So this two people developed a platform called open a cubit saves all this data from many website in the world and make them available for everyone to use. So vice values don't deserve here of a time anymore. For instance, that's awesome, and open IQ has all its data available. But my understanding is even on top of that they've got software. Right,...

I've got an our package. Yeah, so that's my contribution, my small contribution, to far more cross writing an our package to get to data. So that's so they have an API. So so they mix the data really easy to access after writing all the adapters from official website and data website, and so I world this our package that anyone can get this data in. In our we are knowing anything about API prairies. What type of techniques are most important when thinking about this type of data? I mean you've got geospatial data, time series, you're interested in prediction. What what type of techniques do you use when when working on this? You mean on Air Quality? Yeah, a quality epidemiology, the type of stuff you've been working on. And so my impression is that often we have not been using fancy techniques. But what I was taking the most time is thinking of our to communicate uncertainty to people, to colleagues in papers, because it's so such a complexition, important notion to communicate and not not not giving a single number when you're estimating something, but also giving continents intervals or prediction intervals, and I would you explain them. So that's already important thing, absolutely, and we see that, you know, in in the media when presented with job reports or polling and and this type of stuff. That's commonly a lot of these numbers are reported as concrete numbers that are taken at face value, as opposed to the news being reported. This is a number, but this is how certain or uncertain we are about it, which would be great to to see that type of stuff more and more, I think. Yeah, and I'm thinking of of an example. It's not related to a pit emerge, but they've seen recently. I don't remember where there were said, but you know, when you get a review and Amazon, you should trust a review. But as a Muss that I've been read written by more people in total, bands reviews better, only like I don't know, five opinions. That is a great example. I mean that's in essence. We're talking about sample size.

They're right. So if you've got a few reviews, you can't really be certain of how good good the product is, whereas if you got five thousand to tenzero reviews, you you can very cool. So we've touched on this idea of data management, which, of course, is becoming in bracingly important as we get more and more data, data from the world. Maybe you can speak about some of the work you've done in in data management, particularly on hi. Yeah, so one thing that was important was, of course, data cleaning and building nestcroll database. But one thing that I was important was the documenting the database. Like now I've left this project and people will use the data and in a few years from now most of the experts from the project what I've left. So our go people going to be able to use these database. You need to document that because it was the first time I was documenting such a big database. I had to look for ways to do but and the best way to document database nowadays is to use Meta data, which is data about data, following a standard, and what I use for Chai was email, the ecological metadata language, but it's actually also applicable to over fields, so it's set of xmls. Don't that's obvious sounds very well. Obvious sounded very scary to me in any case. But when you use such a standard you can, because it's match in readable, you can compare it to the standard and validated. So, say the standard needs you need, so you will need to document units and, based on as needs this thing, you will need to add it. For me, it was very good to find such a solution to document data because I was really afraid to forget something because it was the first time. I wasn't sure what to put in the thementation and using a standard, you're completely sure. Well, our most issue where you've documented everything. I should be documented and in a good way. And because it's machine readable, we can hope an in a few ures. If someone has data from a number, if it immediate product...

...somewhere else, it will be easier for a researcher to merge the data sets and to use them in a met analysis because the documentation is machine readable and similar in both projects. What type of problems would occur if if you hadn't done this? Huh? I mentioned but I could have forgotten some important things, like if, if you don't document units, your numbers become you. Let use this a few years from now, and I guess that's one of the most important issue. And if I think of the first database I ever documented, like a long time ago, I remember writing a what notebook. So I hope, I think I had saved US PDF, but maybe maybe I could have forgotten to do that. So if your documentation is in such approprietary format, you could think, but maybe sometime, like a long time from no, people won't be able to open it or rise with an XML. This will happen. It will. It's like like a text file. You can still open it and read it. Yeah, the idea of units is really interesting. I'm going to get this story completely wrong, so I'll just mention it very briefly. Craig Wilson, a colleague of mine who you know who I'm created the carpentry software software commentary it originally. He has a story of a data analysis project where there was one feature called temperature and it wasn't marked whether it was in Celsius or Fahrenheit and it was from the Antarctic or something like that. So it was around the range in which Celsius and Fahrenheit coexist or they're the same, and what they had to do in the end was trace back where the ship where the results were taken had had left from, because the port where the ship leaves will tell you what units were actually used. Sky was actually use something, something incredible like that. So I think that example speaks to exactly what we're talking about. Yeah, that's any but just mentioned that because we had a monitors, but output was output...

...was also a temperature, like it in shy, but we had units that. There was another valuable that you got fat was you need to we have such a tea ball with numbers that a temperature and another column if if you wish, invist data set is our units, because units and not the same for every line. Yeah, yeah, and the other thing that you hinted at was data management being essential to having reproducible workflows in reproducible results downstream as well. Yeah, and I also think the dimension you off can be reproducible. So everything I've done, things of data, was done with then our script. So I did nothing by hand. Like I even you know, I add some points I care to typos and I had a lot of R G subcomments for correcting all the typos, because I really didn't want to open the high and Chroas the type of my typles myself. So race way even to data management is reduceable. So are we got from the road. Data too is a corrected data, but we but we use for forever. That disease is reproducible and which is great because if at one point for a surprise, but what by one value, we can probably trace the mistake back or trees or regine of the problem, back to where we came from. Now it's time for a segment called Stack overflow diaries, with data camp curriculum laid Cara Woo Hey, Cara, heyre you go. I've got a really fun installment of Stack Overflow Diaries for you today. I am pumped for this. What question are you going to tell us about? The question comes from the statistics stack exchange site and User Kevin Kim wants to know how to understand the drawbacks of k means clustering. I'll post a link in the show notes and definitely encourage people to read the answers to this one. Can you tell us a bit about Kyman's clustering, just to set the same absolutely, Ky means clustering is a way to divide up data into a given number of groups so...

...that each data point belongs to the group with the nearest mean value. So, for an example, say you had a data set of the weights and heights of some babies and adults, but the data isn't labeled, so you don't know which are babies and which are adults. If you lot of this data on a scatter plot, there would be two fairly distinct clusters of data points, and k means is a way to automatically group data into these clusters. That's a great example. So what was Kevin Kim's question? I'm going to paraphrase it a little bit. Kevin Kim says that the drawbacks they've read about k means clustering are that it essentially assumes that clusters are spherical and roughly the same size. However, Kevin's understanding of CA means is that, no matter what, it will produce clusters that minimize the sum of squared errors. So they don't understand the link between this and the assumptions described. That is a great question. Now I'm looking forward to the answer. Part of the reason I chose this question is that it features a truly stellar answer from Dave Robinson by generating and plotting some data. He demonstrates how minimizing the sum squared error does not necessarily yield the most natural clusters. In the first case he shows a scatter plot of data with two clusters. One cluster is concentrated in the center of the plot and then there's a ring of data points around it. To the human eye these are clearly two concentric clusters of data but, as Dave demonstrates, ca means clustering splits them right down the middle of the plot instead. In another example he shows data that, to the human eye, clearly form three clusters with different numbers of points, but kmanes ends up splitting the largest cluster into two. And another user, a nonimose, expanded on Dave's answer and point out that k means will also create clusters even on data points that are totally uniform. I'm a huge fan and of showing these types of principles at work by our example. So, to reiterate, Ka means will always minimize the sum of squad errors, but the point is that there are data sets in which you don't want to do this in order to cluster them right exactly. Hugo although Dave shows cases where Kay means, quote...

...unquote, fails. As he points out, these failures reflect the assumptions of the algorithm that allow it to succeed in the kinds of cases it's designed for. I'm going to quote Dave because he says it's so well. Assumptions are where your power comes from. When netflix recommends movies to you, it's assuming that if you like one movie, you'll like similar ones, and vice versa. You can't make a recommendation algorithm without making some assumptions about users tastes, just like you can't make a clustering algorithm without making some assumptions about the nature of those clusters. Thanks, Cara. If you enjoyed that listener, be sure to check in for the next episode of data framed. Our guest will be none other than Dave Robinson himself, here to talk about citizen data science. Thanks, Carra, once again for reading us a page from your stack overflow diary. Of course, Hugo and chat next time. Let's now jump back into our interview with my l salmon. Something we've mentioned in passing is the idea of community in in data science, and you're part of the our stat stats community, which seems like a really, really wonderfully welcoming open community for the most part. Can you speak to the idea of community in how you've felt your work has developed? Yeah, so I've always felt very welcome into our community. So I think I got on twitter roughly at the same time, but I study doing data management and reading things on twitter, asking questions. So before, but when I was working on my PG I wasn't my own but when I started discovering this old community. So it's welcoming, but I think we could do better. So, if so, right now the is a community is not as as divers it could be. Sorry, probably missing talents from some part of the world. We're probably missing talents from women. So we are currently some initiative in there our community helping to increase diversity and to not miss this talents and also to make the community welcoming for everyone, even the people that had bad experiences before. So they very well.

They feel they they can take part in this in this community. What types of initiatives are you talking about? Yeah, so where is one from the our foundation called for us for forward studying study out as being a task for us from the our foundation for women and then it branched out and it's now an initiative for Dis for women, but also for people, for instance, for parts of the world. But I often not represented in their our community and they have a lot of actions, one of them being surveying people after you use a conferences to get a better idea of our experience and of the characteristics of people taking part in user conferences. And they also gave scholarships for user this year. So some people but probably couldn't have come to bressels could come and when they did come, thanks to their scholarship, they could also take advantage of a conference betty system. So think like you can be at a conference and not meet anyone, because it's very hard to start speaking to people if you don't know any person. So good a conferends betty are you was to have a first even for the scholarships holders to meet people the conferences bad where the conferences betties, and then during the rest of the conference we adds some people to talk to and when could meet many other people or raise the conference experience can be less cool if you feel shy and isolate the old time. So that's the kind of thing forward stars. So this is one initiative. Another one is a ladies which is a grass root organization, so it doesn't come from the Air Foundation. It was created locally in San Francisco, back every day the cares a few years ago and there was another chapter in London and this chapters got together and created alliedies global, which has helped creating many local chapters since I think two thousand and sixteen. For instance, I co created the bacelona chapter in Spain and they, this chapters,...

...him that I've meetups and provide a safe and friendly environment for women and actually any any person but does not identify as man. So they prove it is provide a good experience for all these people. Yeah, that's fantastic. So that's diversity of data scientists. What about diversity of users? When you're a data since your user can be I mean depends on your perspective. As say, even if you're not a data sense you're an epidemergence and you create a map of incidence of Salmon Alinea region and you decide that the county is with the most incidents of someone, I will be read on the one with the list incidents of someone I will be Green. You have what you think is a very intuitive color scales. So because ready is bad and really it's good. But actually this is the kind of is validation but cannot be seen by colorblind people, one might think. But because the majority of people are not, club line is not an issue. But first this is quite unfair to not think of all your users. You should try to accommodate for everyone but that might want to use your visualization. And also it's interesting to see that they are tools currently frons in are but help making color blind friendly graphics with not not much effort. So you have a package for seeing how your visualization would look like if you're a colorblind so it's called color blinder and if you use it to already see that red green is not a good idea. And then you have over our packages providing color skills. But are colorblind. Friend he is of Yaise, for calls color space, for instance. So this means that is very easy for you to make sure graphic friendly to color blind people. Just need to be aware of the issue. And this is one issue, but I know of but I guess that sometimes when I produce something, I'm not saying of all users and I like to know more about how...

...to serve the diversity of users of data seconds products. That's cool and I actually appreciate that, as I myself am red, green color blind. So I definitely definitely appreciate that. Are there any other types of users that that? Yeah, so you might have noticed by now I'm not knowne native English speaker, and I guess that's okay. Is for a lot of people in say there our community and most of the materials in data science. So are an English and we need efforts to accommodate for all kind of user, even the non native English speakers. So right. So now I'm an I user. I have experience with our so I'm not afraid of reading our error messages. But say, when you start using are you are is a foreign language for you, and if English is a foreign language for you, it's even more difficult. So our error messages are translated, which is great, and if you start looking at videos to learn about ours, say on data com, it's great to see the data come now as subtitles to help people ready be able to grasp all the content without too much effort. So absolutely should not forget, but not everyone speak English. To begin and we're very fortunate a data camping that we've actually had members from different communities reach out to us and translate some of our courses into into their their native language. So I think in the early days even our first introduction to our course and intermediate are were translated into Inter Mandarin, so that that existed in the community, which was was beautiful to have that type of type of outreach. It's fantastic that error messages can be translated into into different languages. How many languages you know are support it in our for example? Actually have no idea, but I I don't I would see a dozen. I actually looked at it recently because I was at an ecology accason where a group was working on translated error messages...

...not not into another language but into simple English, likely you know the wikipedia simple English. or we're making error messages understandable for the students and that's why I looked at this page that you have of all the translation teams of error messages and I think a good dozen of languages were supported. Yeah, I was saying on twitter most of the our conversation happened in English in the air. On the air stats Hashtag but recently two new hashtags of immerged, one of them in Spanish re stats yes, the same Hashtag with yes at the end, and number one is asked that I've are for French. So I I weally hope that this conversation, conversations will happen in as many languages as are a messages, for instance. So a lot of people, I think, there's a perceived barrier to entry for data science, thinking that it's something that only experts can do, which we're trying to break apart in a number of ways. My question you is, what are the roles of education and outreach in making data science and Statistics as accessible as possible for everyone? When you start in the sense, you will need to learn things. So I guess that's why education is so important. But what's important is to ready empower people of the belief that if they but we're able to learn this things. So not tell them hey, go and go and educate yourself. So having them to get udicated in a in a ready friendly way. And one example I have this. I said I submitted my first package to our open side two years ago and ready there was there are things that I didn't know anything about that. I think I didn't even know continuous integration, which is part of the guidelines of packages. And the reason why I stayed in by community and then so much is because people were so friendly and helpful and they're already had no trouble helping me about something we keep doing in our open size. So if you were not too sure about some things our package element, we...

...can help you with that before you submit your package. So this means people feel they can learn about package element and of cross ups, about they learn about that. And what about the role of community in general in in helping people get started? I guess, for instance, if I think of the alids community, we tend to several is of a lot like it. For the bust or not local chapter. We have these meetups where we could talk about our but we also add a local slack to discuss and we had an our help, help channel, so where people could ask that question. And I guess a community part of it is that you're not the people in this Communitye. Ask Sort of friends, so you're not afraid to ask them questions. So You well learn together and in this safe environment. So it's really about being as opening and welcoming and, I suppose, have as much empathy as possible for beginners. Yeah, but so this idea of being empathetic with beginners is something of those made popular by Jennie Bryan, for instance, so she really made this problem. I think it's an interesting idea. So not not to think as beginners as stupid people, for instance, or our people who don't make enough efforts. Of all, I know, sure, but a lot of people think think that anyway. But ready think of our from this perspective and how you can document, for instance, your tools so they can understand that, because we were all big it start with. So we've discussed a lot of different different aspects of data science, techniques, methodologies, applications. What's one of your favorite things, techniques or methodologies for data sign something you enjoy doing the best? Something you've mentioned previously. You said I should tell a newcomers in in datas hands. Right, we're best friends where I will be selftin giants. So that's my favorite techniques. When I have an issue, I will first look scientific literature and I will google the issue to find what exist, not because I like Ci Gioirism or something, but...

...because I'm quite true at I can be aspired at least by what or already exist, and I don't like reinventing the wheel. So I like being creative, but I like being used to see creative. So I think that's my favorite methodology, looking at what already exist. I love that answer because, you know, a lot of people come in and say extreme gradient boosting or random whatever, or data visualization. But thinking about this, this is practice of using search engines as an actual data signs technique. I I love and I totally agree. So we've discussed a lot about modern data science, about what it looks like on the ground today. What does the future of data science look like to you? So I guess my answer will be and FROMS, by way, I see as prooms nowadays and data saynds. So because I I am interested in diversity with my friends, may even more involvement in our ladies, I really hope that future data saynds will be more diverse, like with a real diversity of data scientist and also no runness of the diversity of the users. And I hope, for instance, that all more cold will be open about more coal where I've been reviewed by someone, like by someone's colleague or by a formal thing such as are open sighe. So I hope probably be more diverse and more with you. As one last question, do you have a final call to action for people who want to want to break in on data science in general? Yeah, so, if you want to get involved in Datas hands or even in the peris data's hands. So your role as a data scand is you will be to bring data snds to your team. So you should. That's why what you should learn about in priority, and I think it's important to get involved in the community, and that's also what will make it fun and enjoyable for you. And there are a lot of ways to get involved in your data science community. So I will speak up our community, because that's the one I know the best. So you can, for instance, submit interesting things that you've read to the our weekly news letters, but over people can read about that. You can...

...blog about what you learn. So even as a beginner, and especially as a beginner, and even as a none thattive English speaker, and especially like if you want to write in Spanish. So you should blog and report what you've learned and how you've learned about it to help people on there in on the learning journey as well. You should also meet people like are, not only online but also in the real world, by getting involved in your local high as the group at least chapter you could start from. So I co founded on Ali this chapter and I promise it's a very cool experience and it's very worth all the efforts that will have to to do an if you want to learn more about cold quality, which will be important in your data senses carrier because you will be writing a lot of code, you can volunteer to become a our package reviewer for our open say or for genals, hayand software, which will make you better acquainted with best practices in writing software and writing science. Right. That's that's a wonderful amount of useful and practical on the ground advice. Myl It's been an absolute pleasure. I have it having you on the show. Thank you. Thank you for having me, and it was a pleasure for me too. Thanks for joining everybody in this episode of data framed. My l told us about some of the most pressing challenges of Modern Public Health in epidemiology and air pollution, along with how data science, statistics and open source software tools can help us to solve them. Be sure to tune in for our next episode, in which I talked with Dave Robinson, data scientists that stack overflow, about citizen data science and a future in which data literacy is a skill possessed by everybody.

In-Stream Audio Search

NEW

Search across all episodes within this podcast

Episodes (121)