DataFramed
DataFramed

Episode · 4 years ago

#8 Data Science, Astronomy and the Open Source

ABOUT THIS EPISODE

Jake VanderPlas, a data science fellow at the University of Washington's eScience Institute, astronomer, open source beast and renowned Pythonista, joins Hugo to speak about data science, astronomy, the open source development world and the importance of interdisciplinary conversations to data science.

In this episode of data framed, a data camp podcast, I'll be speaking with Jake Vander Plas, a data scientist, astronomer, open source beast and renowned PYTHONISTA. Will speak about data science, astronomy, the open source development world and the importance of interdisciplinary conversations to data science. I'm Hugo bound Anderson, a data scientist at data camp, and this is data framed. Welcome to data framed, a weekly data camp podcast exploring what data science looks like on the ground for working data scientists and what problems it can solve. I'm your host, Hugo bound Anderson. You can follow me on twitter at Hugo bound and data camp at data camp. You can find all our episodes and show notes at Data Campcom community podcast. Hi, Jake, and welcome to data framed. Hi thinks it's good to be here. So good to have you and I'm really excited to have you on the show to chat about data science, astronomy, open source, python, all the wide array of things that you're interested in. But first I'd like to find out about you. What are you known for in the data science community? Yeah, I think I'm I'm mostly known as a has I'm tape cast as a paython person. Appropriately, I'm probably known for Psycheit learn because that that was my intro into the community. That's where I sort of started getting go, where I got started with open source and with with contribution. And I'm also known as an astronomer. I did my phd in astronomy and even though I don't do as much astronomy research these days as I used to, I still feel like I'm pretty close to that community for shore and these are these are all things that we're going to chat about in this in this conversation, from, you know, the pot and landscape to a machine learning and so I get learned to your to your work in in astronomy. But first I'd like to know how you originally got involved indictoscience. What's The backstory there? Yeah, so I...

...started in Grad school and I went to Grad school for astronomy mainly because I liked physics and I liked astronomy as an application of physics. So I didn't really have data science on on my mind at all when I jumped into it. But going into going into science these days you have to write code, you have to analyze data, no matter no matter, pretty much no matter what kind of science you're doing. And so from from my first quarter and Grad school I started learning python and, you know, as I got deeper into it, I really found that I started to enjoy the data analysis and enjoy the software portion of the research that I was doing. So I published this paper in two thousand and nine that was on using manifold learning, like locally a linear em betting, to study Galaxy Spectra from this low digital sky survey. And I finish that paper and I had spent a solid few months writing this code for efficient manifold of learning with a python interface and at the end of it, you know, I put a tar ball of the code on my website and I thought to myself like this, this is ridiculous. You know, then the next person who comes along who wants to do manifold learning for astronomy data, they're not they're gonna have to totally reinvent the wheel. You know, I don't know if I could go back and download my tar ball and use the code the way I had written at the time. So I yeah, I emailed the SCIPI list and asked, I think I actually asked about the ball tree code that was a component of that manifold learning code, but I asked if if it was something that folks would want in the scipie package and they were like crickets for a while and then I think it was guyle of Arco, who's been leading the psyche learned package, emailed me back and said, Hey, you know, we could, we can use this code and psyche it learn and I said sure, that sounds great, and so I got involved from there and caught the open source bug and...

I've been doing this, this open source python and open source what became known more more widely as data science, ever since then. Yeah, and you spoken to a several really interesting issues. They have from the role of statistics in in dictosigence, role of programming in dictosigns, the role of essentially what you're being do with respect to putting this tabowl up and rethinking that is the idea of reproducibility inside and data sciences as well. But would respect to astronomy, I am so. In a previous incarnation I worked in applied math in in the biological sciences and it will on campus. It was always when I needed to figure out how to analyze my data, how to do serious robust statistics, I always went to the astronomers because they always seemed the most most adept and know that, and you the techniques that I needed needed to do. Yeah, yeah, that makes sense. Yeah, yeah, so the I think that's probably do. So astronomy was kind of a an early adopter of these these large scale data things. As as as history goes, like I think it, I would I would peck it back to the Sloan Digital Sky Survey, which was this at the time it was it was a pretty groundbreaking survey. You know, previous to the Sloan Digital Sky Survey, and mostly what people how people did astronomy as they decided, you know, I want to look at this particular star or this particular Nebula or this particular galaxy, and they'd write up a proposal and send it to a telescope allocation committee. In the field we call them attack. So you send your proposal to attack and you try to get allocated time to look at this particular object or class of object and there were, you know, hundreds of astronomers vuying for time on the biggest telescope. So you need a really good posle to say, you know, this is why I should use this very expensive instrument instead...

...of anyone else, and then they you know, as an astronomer, you'd go up and you'd gather that data, you'd spend a few months processing it and writing about it and then write your paper. Right, and so this is how astronomy basically worked up until say, the S and early S, and starting around then, maybe a little bit earlier. But but the big player in this was the Sloan Digital Sky Server that it started in the s and the idea there was that instead of having individual astronomers use the telescope every night to look at specific things they wanted to, why don't we just make a telescope that skins the night sky and looks for everything? Right. So the Sloan Digital Sky Serray did this, this photometric scan of the entire sky, basically find taking images of the whole sky to find out where interesting objects are, and then went back and did over the course of about ten years, for the first survey, did a spectroscopic scan of about a million of those objects. And spectroscopic observations are like basically taking the light from an individual object and splitting it into its different wavelengths. So you get a graph of brightness versus wavelength for the object and I can tell you huge amounts of information about what's going on in that galaxy or and that star and that quays are whatever you're looking at. And so the the the real opportunity there was all of a sudden there was this huge, vast swath of data that was open to anybody who could analyze it. You know, the data was posted in this on in this public database online, and literally anybody. You don't didn't have to be an astronomer, you didn't have to be affiliated with the university, you could go and download that data and try to learn things from it. So it really, in a lot of ways, changed how astronomy was done. You know, all of a sudden you didn't have to you didn't have to think about specific objects and apply for telescope time. You were you...

...were thinking about data mining right and trying to trying to sip through large amounts of data to figure out what's interesting, to answer the questions you want to answer. And that that sort of survey astronomy idea has carried forward and a lot of the big projects Nour are operating in the survey mode. There's there's definitely still individual telescopes that you can use to look at individual objects in the classic sense, but surveys are a big part of modern astronomy. So what you what you spink to? There is really a revolution in the way data is collected and then the ability for everybody to access access that data. Yeah, exactly, and we're seeing a little bit of that in other fields. To you know, all sorts of fields. You know, we have like ocean a graphic surveys where you can go and download this ocean data from ships that are dry. I've been around. There's Geno mcdata and other biological data. So so lots of fields are seeing this sort of revolution. But I I think in many ways it was astronomers and probably before that the particle physicists, who really led the way and in kind of large data intensive surveys and science. Right. So my question is, once we have this explosion in the amount of data available, well, in any discipline, but let's speak about astronomy, is there a need for techniques and technological infrastructures, such as databases, to actually catch up with this? Yeah, absolutely, and if you look at the early Sloan Digital Sky Survey. The Real, real innovation on that front was done by by Jim Gray, who's wellknown in the computer science database community and you'll find quotes from him all over the place because he was so prolific. But he teamed up with a guy named Alex slay who was at, I think at John Hopkins at the time. He's at Jons Hopkins now. But they basically were the ones who came together and said this is a new way of doing astronomy, we need new tools and they said...

...let's develop these database tools for use in astronomy and let's start to train. Let's start to train astronomers and how to do this. Incidentally, a small, small little side note, my phd advisor, Andy Connolly, did his post stock with Alex slay working on these database things. So so I was, you know, I was brought up in that tradition in my in my Grad school career. So I'm thinking as well just the types of algorithms you'll use to do do your analysis. So I'm not sure what type of algorithms you've used historically, but their whole bunch of machine learning algorithms. Kinearest neighbos or even, you know, finding the main finding the average of a Dita set, which don't necessarily scale so well when you get huge Dita sets. Right. So's this this been a challenge for the yeah, yeah, absolutely, and it's it's a challenge because the the mode that most astronomers are, have been used to working with data and still are to big extent, is you download the data onto your computer and you visualize it in, you know whatever, IDL or Matt lab or python or something like that, and then you you start like, hopefully python, of yeah, pythons getting it's more python these days, but you know, ten, ten years ago it was it was definitely favored IDL and in other languages, and so it's just different mode of working with data and this different way of thinking about data. So some of the work that I've done, for example, is like what one thing that you often want to do in astronomy is figure out look for things that are varying in the sky. It's the stars that are that are brighter tonight than they were yesterday, and there's a certain class of variable stars that are periodic variables, which means that if you plot their brightness versus time, it has this regular pattern that's kind of like a sign way of but slightly different, you know, not not exactly a sign way of but it's periodic. And these are important, these because they can be used. I'm getting really, really deep in here, but but one reason these are important is...

...there's these this type of variable star called cepheid variables, named after the Delta Cepe, the fourth bright of Star in the CEPHIS Constellation, and they were discovered early in the twenty century, that by by a woman, Nahimmery Henrietta Levett, that if you look at the the variability, the period of these cepheid variables, you can relate that to the intrinsic brightness of these stars. And if you compare the intrinsic brightness, like the amount of energy coming out of the star, to the apparent brightness, that gives you a way to determine the distance to those stars. And so that that's what, fundamentally what Hubble us to discover the expansion of the universe and kind of confirm this this big bang idea and gave way to the basically birth modern cosmology, where we're talking about expanding universe, Big Bang, dark energy, dark matter, that all comes from being able to locate and observe these variable stars with a particular periodic signal. And so if you're given in an individual star, it's pretty easy to do that. You know, you just you maybe compute a periodogram and you plot the star folded over a couple of the candidate periods and you sort of look at by eye and see which one is best and decide whether you think it's, you know, really a periodic variable and then you you compare it to what a ceph you'd like curve should look like and decide if it's a SEPI had star right. And this is the mode that all lot that up until the s people did. They looked at each individual object. But in the in the era of survey astronomy, we're looking at surveys like the large synoptic survey telescope that's coming online in a couple years and this is going to have light curves for a billion stars. I hope I'm getting that right. I think it's like a billion. It'll be a billion candidate stars with ten years of data and you just can't scale those kind of like individual inspection methods up to that many objects. And when you start trying to do...

...things like the long scargill periodogram and some of these period it searches at scale. Like tell tell a computer to just loop over the billion stars and do it for each of them. You end up with lots of issues of false positives, false negatives. You know you have to. You're working with noisy data that the algorithm might not handle very well. So so a real big thing in astronomy these days, I feel, is it's taking these pride and true methods from ten, twenty, thirty, fifty years ago and trying to figure out how you can, you can apply them at scale, whether you whether that's or applied them with the kind of data we have, it might. It's a at scale, but it's also with heterogeneous data, with noisy data, with the various constraints of the survey, because you're no longer just observing, you're no longer using the telescope in exactly the way that you want for your particular data question. You have this this data stream that's coming at you and you can't, you can't and in many ways you can't control exactly what that what sort of observations you get. Yeah, so you have to modify your methods to work with this sort of noisy and heterogeneous data. I get the impression that this is where the term data signs kind of comes into play, as opposed to it being statistics or programming, that when we have, you know, these huge streams of data, Sooad, as you said, dat or at scale, noisy data, heterogeneous data, as in a lot of other research disciplines and a lot of industries, that's when all these things coming to play together that kind of data science is formed. Essentially, yeah, definitely. I mean I think that's true, that it knows, it's fundamentally, for me and astronomy, data science as fundamentally about about using these computational and statistical methods together to answer questions with data, and essentially it's to do with the the scale of the data as well. Yeah, yeah, and and data these days basically a synonymous was scaled absolutely. But you...

...said you don't do so much astronomy these days. Yeah, sort of transition. So I finished my PhD in two thousand and twelve and I did a short post stock with the large SNOPTIC Survey Telescope Group at University Washington and then I did a couple of year and ASF fellowship, where I was actually in a computer science department and I was working on sort of astronomy applications of database research for a while and that led me into what my current position is, which is at the at the e Science Institute, and are our goal in the e Science Institute here at University Washington is is explicitly interdisciplinary. So we're working on connecting domain researchers like physicists and astronomers and oceanographers and geologists, connecting them with methods researchers like like statisticians and computer scientists and and data science and sort of and also connecting them with each other so we can we can get that kind of dialog farming. So because of that, I've I've sort of backed off on on like doing a hundred percent astronomy research and I'm doing a lot of collaborations with people from different fields where, because of my background I've I end up being kind of the methodology expert right so I can I can talk to them about what computing tools or what statistical tools or what machine learning algorithms might work well for their data, and it's been a lot of fun because, you know, as much as I love astronomy. I'm finding that I really love the software and I really love the methods and I really love the breadth of what you can do with them. And it sounds like you can speak to a lot of people working on different research questions in different disciplines and contribute to what they who as well. Yeah, yeah, and that part's been really, really fun. So we have people on staff here who are data scientists and research scientists from, you know, a dozen different disciplines,...

...and then the people I interact with on a daily basis, whether their students or or research scientists or post Docs or faculty, or they're coming from around campus and and all different backgrounds. And yes, so, so, yeah, I get to get to contribute and have my hands and a lot of what goes on. Now let's dive into a segment called Stack overflow diaries, with day to Camp Curriculum Laid Kra wo, hi, Cara, hi, you go. Things are going to get a little bit Meta on Stack overflow diaries. Today we've got to complementary questions from the R and Python sections of stack overflow, both on the question of how to make a good, reproducible example links to both of these will be in the show notes. Sokra tell us a bit about the importance of reproducible examples and their relevance to stack overflow. If you've never asked a question on stack overflow before, it can be a little bit daunting. One of the most important things to do is create a reproducible example of the problem so that other people can try it out and see what's happening. Fortunately, there are some stack overflow questions that tell you how to ask a good stack overflow question. That sounds like inception, that's exactly what it's like on the our side. There are a few components listed that are key to a reproducible example. A minimal data set, the minimal runable code needed to reproduce the error and information on the system the code is running on. You can construct sample data using functions like Matrix or data dot frame, depending on the type of data in your problem. Another option is to use a data set that's built into our or you can use the deput function to paste part of your real data. But the key here is providing data that readers can use immediately by copying and pasting the R code that you provide. But you don't want to include too much code, do you? That's right, Hugo. For the code, you really want it to be the minimal code necessary to reproduce the error. The more steps you include, the harder it is to identify what's relevant to solving the problem. It can sometimes be helpful to also include information on the system your code is running on, which you can do by pasting the output of the session INFO function.

Lastly, there's an our package called Rep Brex, written by Jenny Bryan, which will help turn your r code into a reproducible example that you can post a stack overflow. And how about on the Platon side? On the python side, the principles are basically the same. The community recommends providing a data set definition as runnable code or is something that can be copied and pasted using panda's read clipboard function. Don't reference data that readers don't have access to, though. Instead provide some sample data that they can use. The data set should be as small as possible, as most questions can usually be solved with five or six rows of data, and too much data just distracts from the problem. After providing the data, you should describe your desired output and the code you've tried, as well as information on what other sources you've consulted. It's good practice to do your own research before you ask a question in case your issue has been addressed elsewhere. That's really helpful, Cara. As you say, it can be daunting to ask questions on stack overflow, so it's cool that there are such resources to help it sure is. Thanks, Cara, once again for reading us a page from your stack overflow diary. Always a pleasure. Hugo. After that interlude it's time to jump back into our chat with Jake and I saw, and we've chatted previously, about the fact you have an interdisciplinary data science seminar as well, and I looked at the recent list of talks in have talks from social sciences, biology, astronomy, statistics, urban planning. That seems like a very vibrant culture you have there. I'm I'm wondering what, what type of value do you see or what is the role of these discussions across disciplines in data science? Yeah, so the main thing kind of historically, where the where this idea developed, the idea of the science as a as a cross disciplinary glue in some ways. You know, I talked to some of the professors who've been around, you'd, for thirty or forty years, and they often mention this thing that...

...used to happen, you know, back when computers weren't on your desks, when when computers were in the basement of the computer science building, right, and the way that you use a computer was he signed up for time and you you went over there with your stack of punch cards or whatever, and you hung out until you could run your algorithm. And one thing that apparently happened back then, I'm told, is that that that brought these researchers from different places and different departments together. You know, there was this water cooler effect where you chat with people and and people would end up realizing that, you know, if you're in a strawn or an astrophysicist or you're an atmospheric scientist, you might be solving the same set of differential equations and be able to compare notes on the best numerical approaches to solving those, that sort of thing. So when this e science institute in its Fur, in it's current guys, was formed in two thousand and fourteen, that was one of the explicit goals is to kind of bring back that water cooler culture and that's that's one of the goals of this is Introdu Disciplinary Seminar we have and we've seen that come out. It's been it's been really incredible. You know, I had one of my first projects in the Science Institute. I was working with a geophysicist who was studying earthquakes under Mount St Helen's and had these huge array of sensors measuring these time series like basically they the shaking of the ground underneath the Mount St Helens and wanted to cluster them and figure out a way to cluster them in a way that was computationally feasible and started digging around and one of the one of the methodologies that ended up being successful there was something that was published in an astronomy journal to cluster time series of variable stars or variable...

...galaxies. So it's you know fundamentally what was going on as you're taking time series data and trying to find similar things in it. And it doesn't matter at that time series is coming from a telescope or from a from an earthquake monitor right, it's still data. And so we found those sorts of connections can be really, really fruitful, and I love that because that that essentially is solving, working towards solving the problem of research occurring in in silos. And I was constantly reinventing the way at while colleagues are doing or collaborators are doing similar things. Yeah, it's really funny the kinds of the real barrier when you get people talking sometimes as vocabulary, you know, like one person's Gaussian process is another person's krigging and one person's principal component analysis is another person's current Louve decomposition or whatever. Right. So as soon as you start like figuring out that people are talking about the same thing, then you can really compare notes and learn from each other there in a meaningful way. Absolutely got to get people together first and start to understand that everyone really is doing some version of the same thing applied to different fields. Yeah, and do you see this these kind of a spects of silos happening between academia and industry as well? Could they be more of a conversation between academic research and dietoscience and what happens in industrial place? I think they're definitely could. There's there's ways we can learn from each other and that that's another of the explicit goals of e sciences to build those academic industry connections and try to make it, instead of a one way door, make it more of a revolving door. So we have a couple people on staff, a couple of our data scientists who have come from industry roles. You know one who has worked in finance and worked in machine learning and Amazon and now she's here at e science advising students and working on interpretability of machine learning models. And we have another guy who is more software engineering background, but at a background in Google, Microsoft research...

...places like this, and he's coming in and has been. It's really great to have around just in terms of talking about software design and how a researcher who wants to solve a problem in a way that other people can use it can approach designing their software and designing their tools. So so I think absolutely we have we have things to learn in academia from industry, and I hope I have this true vice versa as well. Something you spoke to you you you discussed a science in terms of being an interdisciplinary glue, and I know you're a huge fan of glue in general, and you know you're very well known in the Pike and landscape and I thought maybe you could tell us about why, why you love Python so much and speak to it as as a glue as you have have before. You know, I I love it just, you know, for further the reasons a lot of people say. I love the it's expressibility, that the ease of working in it and trying things out. I've done some work and see and C plus plus and dabbled in Fortran a little bit, and it also always feels so stifling. You know, you to get from from idea to implementation is a really long road, but in python you just you just write the idea down. You know it says it's exit executable pseudocode and you you have an implementation there and you know you may need to go back and figure out how to optimize it and things like that. But in my experience, the the development time, the savings of development time, are much more important than any savings and execution time that you might have if you start writing and C plus plus from the beginning. So that's one reason I like python. But but the thing that has really kept me going and python is the community. To be honest, now I love the Picon community, in the SCIPI community and the pie data community and just the fact that I can go online and and read people's Code and use...

...it and contribute to it and I can post my own code and then people use it. And you know, the first time I saw someone giving a talk based on code that I had written and put out there, it was like this a real high. You know, it's like I was. It felt like I was doing something meaningful and doing something impactful and I didn't need to be honest, I didn't. I haven't always gotten that sense with my academic work. You know, you you publish a paper and maybe a dozen people in your little sub field might read it and a few people might say it the next work year and their own work. But you know, you submit an algorithm, the side pedalsy hit learn and all of a sudden the world is using it absolutely and I actually I recall you tweeted once, and I'm probably going to miss quote you entirely, but you tweeted something like I'm trying to figure out what to do today. Am I going to go and do what I'm paid to do and write something that several people will get, or am I going to go and contribute to something open source and reached thousands of people in the next couple of weeks? Yeah, and vaguely remember that. You might you might be right, but I'm so python and all the packages you've contribute to our open source and you've got to you've contributed to a huge, huge Rafe from psychic learned to Scipi, astropie and and now old are as well. Yeah, what place do you think open source software and the open source community at large having what what is now modern data science? Well, from from the academic perspect perspective, I would say that that open source plays a vital role, you know, for when we're talking about academic data science, what we're really pushing for is is people to use to use software tools and to use statistical tools in a very rigorous way, and the best way to do that is to not have to reinvent the wheel every time you're doing something. You know, I if if someone is going and like at like where I got started in this, if someone wants to analyze astronomy...

...data and use a manifold learning package, they shouldn't have to reimplement manifold learning. You shouldn't be reading someone's paper and thinking, who, I wonder if they got this these details of this algorithm right. You. We should free researchers to focus on the bigger questions and and those software engineering and kind of like specific algorithmic algorithmic details should be taken care of for them. So in that sense open source is really important. Like, if I go and I see that someone is is performing manifold learning and they're using the algorithms and psyche learned, I know exactly what code they're they're running right, whereas if someone says, you know, I wrote this c Plus Plus Code and you can find it on my website, I'm not as confident, I'm not as immediately confident that what they're showing me in their plots is actually the algorithm as I understand it right. I'd have to go I'd have to go to their website and dig through their code and see if I can find a bug and see if I'm comfortable with, you know, the way they approach things. So for for open science, for for science to progress and the way that we want it to. You know, people building on each other's research, and evaluating it and figuring out what the next step is. I think open sources vital and that's been one of the really nice it's when one of the places where data science has really helped academic research is this, this open ethos as applied to science. And I gave a talk at high com this past summer and that was the one of the one of the big takeaways that I hope the audience got from my talk is that the academic community has learned a lot from the python open source community and it's and adopting the open source practices used by the python community has has really helped the academic communities and in particulars helps the astronomy community. And, as you say, I mean you open open source software is totally open, totally transparent, reproducible and version controlled, which...

...hopefully all science will be at some point. Right. Yeah, yeah, hopefully. I mean I I think back to some of the early code that I wrote without version control and I have no idea how I managed to do it right. It's like emailing yourself at are ball at the end of the day so that when you change something the next day you can get back to it. was just absolutely lucrous. So is it? If you is there a challenge in educating research scientists with respect to being computational, able to use programming languages and this type of offwa yeah, definitely. It's a huge challenge and it's something that, when I was a Grad student, was not really being seriously, seriously tackled by the universities and it's still, to some extent, not being tackled, but we're in any science. That's another one of our goals, along with the other ones that have chattered about, is we want to to educate the campus on these data science tools, and so that means everything from offering like software carpentry seminars to offering one off tutorial days where we tell people about aws or tell people about other specific tools, you know, things like this. We one thing that I've been involved with is a is an interdisciplinary graduate course and software engineering. So we offer it through the computer science department, but Computer Science Grad students are not allowed to take it, but it's designed for for people from sciences to come in and learn version control and documentation and unit testing and and all these things that you know are really needed for high quality, reproducible open code, but none of the departments really teach it very well. And the the departments are in a difficult place right because if you think about you know, if you have a bunch of Grad students in astronomy and you want to teach them how to how to how...

...to do all these software engineering things, like, where do you fit it in? Right, there's already there's already a full schedule of classes, and that that list of subjects has been honed and fine tuned over the course of decades. And what do you if you want to teach someone about software and machine learning? Do you drop stellar structures? Do you do you drop? Do you drop Planetary Science? Do you drop interstellar medium right, like all those things sort of maybe, maybe, sound like niche fields, but if you're an astronomy program you need to know about that stuff. You need to know about that to be able to talk to other people in the field. And even more than that, there's a each each professor, each each topic has a professor to WHO's like that's their their area of their life's work. Right. So if you say, Hey, you know, professor X, we're going to we're going to drop planetary science because we want to teach machine learning. Like you got this political problem too, exactly, and I think, yeah, that's social and political question that every every subject that's told has its advocates and it's staff for multiple reasons. Is Very difficult to navigate when trying to insert new things into a curriculum. So you mentioned vosion control. Unit testing is being to incredibly important things. What are the tools and techniques do would you encourage budding research scientists and DITO scientists to learn and play around with? Well, I thought. I think the software engineering pieces are our key. So definitely the version control on unit tests are are big because that's a way to, you know, make sure your code is the same as it was yester day and make sure it works the same as it did yesterday when things change. But just general software design, you know, like learning about about object oriented programming and whether that whether that fits the particular problem that you're looking at. Learning it, learning how to write code that's readable, practicing that, you know, don't don't name all your variables x, Y,...

Z ABC all the time. And then things like Code Review. That's another thing we've experimented with here is trying to get researchers to review each other's code and and look at it and and give feedback, and especially cross discipline. You can also often get some really interesting things there because you find out that what someone's done is they've done this homespun version of principal component analysis or something, and you can point them to a package that does it much more efficiently at scale or something like that. Suggest just the general software engineering practices, I think, are the biggest thing. You know, there's also the statistics and an algorithm side of things, but in my experience, the People's people are have things to learn in both areas. Definitely, but the biggest weakness in academia is on the software engineering side more than the statistics and methods. Okay, and how much statistics and math do you think people really need to know to get up and running? Yeah, it's a really good question. It's hard to say because you need you know all all of these all these algorithms are somewhat mathematical. You know, LINEAR ALGEBRAS way more important to me right now than I ever thought it would have been when I was when I was in college. Yeah, and it's hard to know how to best, how to best treat all that stuff because it's a it's a whole. It's a whole area of study in itself. So I guess I don't have a good answer for that. You know, I agree and I think LINEAR ALGEBA was definitely super important. A bit of calculus goes, goes a long way. Whether people need to know multivary calculus or not is a is a different question. But I always these are things you can learn on the fly as well. Right, you don't need to have set down and taken two years of LINEAR algebra courses doing matrix row reduction and all of these things. Right, you can, it's kind of on a need to know basis.

Yeah, I think you can. Yeah, you can. You can learn it on the fly, but it does help to have that background, you know. I I'm thinking about it like I did sort of learn out linear algebra on the fly when I started on down the machine learning rabbit hole. M But at the same time I'd had I'd had some of it in the past, because you touch on linear operators and quantum mechanics and I think in an early math class we actually talked about, you know at least what a Matrix and Matrix multiplications. Yeah, so I think contained in that that it's really a language to become a customed to. Yeah, I think so. I think so. It's yeah, saying you need some some sort of background and need some sort of vocabulary to jump into it, but I don't know. I don't know what the best resources are for someone to learn that if they're coming from a background that where they haven't studied that explicitly. Let's now jump into a segment cold rich, famous and popular, with Greg Wilson, who wrangles instructed training at data camp. Hi Greg Today. What do you have for us today, Gregg? I'd like to talk about diffing and merging spreadsheets and about how a little bit of engineering could help make data science more accessible to the ninety nine percent of humanity who aren't using version control today. Now, programmers often say mean things about spreadsheets and to people who use them, but a lot of their criticisms are misplaced. For example, you'll hear programmers say that spreadsheets are more error prone than code, but I've never seen any data showing that. Yes, lots of people make mistakes with spreadsheets, but lots of people make mistakes with Python and R as well, and so far as I know, no one has ever actually done a quantitative comparison of relative error rates. And yes, most people who use spreadsheets don't write rigorous tests and checks, but that's not a stone most programmers ought to cast.

And when it comes to reproducibility, I'd argue that using a spreadsheet actually makes work more reproducible, since it guarantees that the data is actually there. So why aren't we all using spreadsheets then? Well, one thing that programs do have in their favor is that they play nicely with version control. In particular, if you and I work in parallel on the same code, we can put our changes in subversion or get or mercurial, use diff to compare them and then, crucially, merge them in a structured way. We can't do that with spreadsheets. The closest we can get is to dump the content as a CSV file and then treated as text, but then we lose the formulas, the formatting, the charts and everything else. So what are you proposing instead? Great, well, Microsoft and leabre office both store spreadsheets and other documents as compressed XML. We know how to deffect some l trees and those tools know how to parse and render the content. All we need to do is well the pieces together so that when version control detects conflicting changes in an excel or CALC file, it launches a three pain spreadsheet merge tool instead of telling us that binary files differ. That sounds like a lot of work. I don't think so. The best estimate I've had is that three developers working for eight months could do a credible first version. And even if it takes twice that effort, it would allow literally millions of people, grant administrators, finance officers and, yes, data scientists, to start using version control without throwing away the tools they're already familiar with. We'd be giving them a rent to drive up instead of a cliff to climb, and honestly, I can't think of very many things that would help get more people started working in a more reproducible and more collaborative way. Thank you very much, Greg. If anyone in the audience is interested in giving this a try, please get in touch. We'd love to hear from you. On top of this, if...

...somebody out there does an analysis comparing the average number of errors in scientific papers produced by spreadsheets versus those produced by programming, send it to us and if it checks out, will have you on the podcast to talk about it. Thanks, Gregg, and looking forward to speaking with you again. Thanks you girl. Time to get straight back into our conversation with Jake. Something we've been revolving around is is this idea of community of data scientists, community of open source developers. I actually saw you, you tweeted last week. I've got this in front of me, so I'm going to read it out. You wrote. Okay, just want to put this out there. I'm not I'm not going to do an an accent. Okay, just want to put this. I do my best to say yes to any request to grab coffee and chat about science, careers, code, life, etc. It's an explicit part of my role at you dub and one that I deeply enjoy. So my question around that is, what is the role of community in Data Science in general? Well, I think community in community is important in science. It's science more broadly. You know, the way that you learn things as by networking with other people. This is why we go to conferences, even though we could, you know, we could all just watch youtube videos and read each other's papers, but you still, at great expense, fly large swaths of people across the country to meet together. So it's super important, especially for for someone who's just getting started, you know, to be able to chat with someone who's been there before and and face the specific the specific challenges and the specific existential crises that come with being in a particular field. So I really find, I've found that to be incredibly important and in my career and my life more generally. So I hope that I can be in that position to help people. where I am right now and, when...

I say it's explicitly part of my role at you do have, I actually I mentor students and I mentor post Docs and I'm trying to trying to be that for them and I also we also have these open office hours at you have, which are kind of fun. So I sit at a desk in our common area for two hours a week on Tuesday mornings and it's just wide open. It's advertised on the website and anyone can come talk to me and ask about anything, and I because, you know, like I said, I'm type cast as a python person, I generally get people coming in and asking me python questions, which is which is always was fun, but it's it's a cool part of my role, my role here, and I like to the reason I put that tweet out there as I like to be able to do that more broadly than you do as well. Yeah, and I think it's really important particularly to bridge that gap between established, working practicing data scientists and and beginners, because it cannot like if you go if you Google data science and look at the image search, you see like layers of vent diagrams embedded in each other and I think it can can be tough and scary. So I think efforts like this to reach out to to beginners and let them know it's okay to reach out and encouraged, is incredibly important. Yeah, on that note, actually, my l salmon, who is data scientist, she's worked in public health, she's big in the as stats community. She told me recently. I can't remember what the conference was, but there was a conference that had a buddy system where people coming to the conference for the first time would be paired up with established people and they, you know, hang out and walk around together and introduce people to each other and that type of stuff, which, yeah, that's a great idea. It is isn't a because if you go to a conference for the first time. It can be pretty intense. Right had it reminds me of my first conference as a Grad student. You know, I was I started in the fall of two thousand and sixteen and in January two thousand and seventeen was the big American Astronomical Society meeting and it happened to be in sea adults, so I just had to go go downtown. I remember going there and just feeling like like...

...a complete outcast, you know, I didn't know anybody and it started out with this social event at the first night and everyone was there, you know, talking with people that they knew from, you know, all over the place, and I knew like one professor that was there, so I just sort of like I followed her around like a lost puppy for a while until it was clear that she was kind of annoyed to have but yeah, I remember that and it's it's good to be reminded of that because I'm I'm definitely in the place now where I go to the conference and I'm catching up with everybody that I've known for years. But like that idea of an explicit buddy system to address that because because it can be yeah, it can be anxiety and do thing. I think it can. And I think another thing that's in the corner by yourself, exactly on order. Yeah, yeah, eating spring worlds. I think the birds of a feather type's help as well. And these it, for those who don't know. These are these are tables where people will sit and chat about machine learning or, yeah, some type of muddling or something like that, and new bees are very encouraged to come and chat with experts there as well. So we've talked around a lot of ideas about about data science in academic research and I'm wondering, in your mind, what are the biggest challenges facing the the discipline is of Dita Science as we move forward? Well, so, so we've touched on some of them, right, because there's there's challenges and just getting people the the technical chops to do that, to use these data science skills within academy and use them effectively. Absolutely, but but something that as I as I get more broad outside of my outside of my astronomy world, it really seems like we need to spend some time thinking about data ethics as well. Right, and because there's there's a lot of there's a lot of issues when you start collecting data...

...about everything in the world. You know, how do you keep people's data private? How do you how do you make sure you're using it, using the data fairly? How do you train your algorithms in a way that aren't going to be biased and aren't? You know, there's tons of stories about this and in sort of the corporate of the tech world and it's it's interesting coming from astronomy because we have a very specific set of patterns that are built around like complete openness. Right, we publish all our data. Often publish all our data as soon as it's as soon as it's gathered, the photones are gathered from the telescope. But there's in other fields you can't really do that. You know, when I work with people from biological sciences or particular health sciences, like there are issues with open data. So how do you do open science in a world where there are privacy concerns with the data? And one of the things that was really fun about my post doc I did the post doc in the database group at the Computer Science Department here and at the time there were some folks there. There were, I think they're still working on it, but they were working a lot on the idea of like differential privacy and how you how you expose data about the world and aggregated way that has some guarantees about the privacy of individuals. So I think some of those issues are going to be really increasingly important as as these kind of data sets grow and, as data scientists know it, no matter what we're doing, we we got to start thinking about, you know, thinking about how the data that were taking can be used and misused and things like that. One one interesting little tidbit. You know, I say that astronomy doesn't deal with these these data privacy issues, but one thing that's interesting is that when you start doing surveys across the entire sky at a high cadence and publicizing that...

...data, those surveys can be used to, you know, track objects that happen to be orbiting the earth. Right and if you compare, if you find all the objects orbiting the earth and compare it with a database of, you know, things that are publicly known to be orbiting the Earth, then all of a sudden you've you've found private, you know or like secret government spy settle. I think, wow, that's right. So I know there are people in some of these surveys that have been talking to relevant government officials about you know, how do you how do you screen the data so that it's scientifically useful but but doesn't let foreign agents kind of sniff out military secrets? Yeah, right. So the amazing thing is even in astronomy we're getting to the point where big data is somewhat ethically challenged. So we've discussed a lot of different applications of ditosis in different methodologies. What's one of your favorite data sign see things to do a technique or methodology or anything along those lines tool. So, yeah, so my my all time favorite, like a machine learning method, is principal component analysis. I just think it's like it's like a Swiss army night if you can do anything with it. And I'm probably in some way influenced by my PhD advisor by this. He in the mid s wrote a paper that was really influential in astronomy that basically he in a Grad student, I think, wrote this paper that was basically applying principal component analysis to these sloane Spectra and it was one of the one of the first really at scale applications in principal component analysis and astronomy. So I when I was a Grad student, I quickly found that, you know, whenever I was going to my meeting with my thesis advisor and I had a new data set or something to look at, the first question he was going to ask me was like, well, did you do PCA right? So so that I adopted that as my my first thing that I do with any data...

...set. And the amazing thing about it as you can use it for Dent dimensionality reduction, you can use it for for finding interesting finding the most important contributors to any phenomenal like you can look at the actual eigenvectors themselves for that. You can use it to reduce noise and data you can you can use it to create a low dimensional representation of a high dimensional data set to get a get a broad idea of the relationships. It's it's really this incredible tool that can do so many things. And with pycket learn of course, for example, it's really easy to do PACI right. Yeah, it's straightforward. You just you just plug in the data and go. Although there's some there's some variants of PCA that are important that are not within psykit learn like like being able to use noisy data or weighted data or things like that. So we've discussed moving forward what some of the biggest challenges facing ditoscience as a discipline in general, though. What does the future of ditoscience look like to you? The Future? My hope is that it just can keeps getting more and more open. You know, we've we've seen with with python and are and Julia and some of these other tools. We've seen the power of open source software. It's really kind of kind of snowball to especially in the last ten years, and the thing that that gives me optimism is we're even seeing that and these these academic communities around campus who are used to most more closed tools. Like I mentioned idel earlier. When I started as a Grad student in two thousand and six, basically everybody in our department was using IDL, which is this proprietary language you need the site license to run at things like that, and at this point probably ninety percent or more of people in the astronomy department are using python. So we're there's this gradient towards openness and the astronomy,...

...astronomy community, and we're seeing that another academic communities as well. So I have a I have a real hope that this trends towards openness. Will, you know, eventually ask them to it to a hundred percent and everybody's code will be out there and ready to use and reproduce. That's a great vision of the future. And in fact you mentioned your python talk last year earlier and you had a great figure of the incredible growth of Python, the python hockey stick. I died. I think you have pulled it and in the show notes will link to that talk. So so I'll listeners can can can check it out. Yeah, sounds good. So do you have a final called to action or a message you like to like us into the data science community, aspiring diata scientists and will season Diata scientists alike? Yeah, you know, one thing I've been up thinking about lately is his I think it's I think it it's never too late to learn something new. You know, I would say if you want to learn something, just just jump into it. You know, I was I think about my own my own path, and back when I was when I was twenty, I actually, you know, I thought my path in life was that I was going to go to a seminary and become a church pastor. That's that's really what I thought at Twenty and then at twenty five I took my first astronomy course and I started, you know, learning that and learning coding, and it wasn't until I was thirty that I gave my first python talk, and I think I'm best known for like Python talks now, but that, yeah, it wasn't until then. And you know, now I keep learning things about that software engineering and about these these new machine learning techniques and deep learning and stuff like that. You know, you you don't have to be worried that you didn't study this stuff in college and just jump in and learn it. There are some incredible resources out there for doing that, whether they're books or videos or tutorials from conferences or, you know, just grabbing coffee was some day to...

...science expert. That's great advice. And that leads me to one day. Is there anything new you'll learning at the moment, or anything in two thousand and eighteen that that you really want to learn? Yeah, so one thing. At the moment I'm in kind of kind of this fun thing right now. I just last week started fifty percent as a visiting researcher at Google. So one of the huge things that I'm learning right now is what it looks like to work with a large team that are all focused on the same software and in the same place. You know, I've done a little bit of that and kind of distributed open source land working with sack hit learned developers around the world, but I've never been in the situation where I go to the office each day and sit down with the people that I'm collaborating with on a particular software project and I'm so excited to learn what software development looks like in that context and to see what sort of lessons I can bring back to the academic community into the open source community. Wow, that sounds really exciting and that that once again speaks to this into play we discussed earlier between industry and academic ditosis research. Yeah, and hopefully those those connections will continue growing. You know, I really right now, if you're if you're an academic, kind of considering the future and thinking about industry, it really feels like a one way door right like once you're out it's hard to break back in. But I'd love it if we could have more of a revolving door more opportunities for people who have expertise on the industry side to come in and help scientists and, you know, vice versa. Absolutely, and we'll have to have you back on the show to talk about you. What could Google once once you've been there for a while. Yeah, that'll be fun and I'm only about a almost two weeks into it now, so it's still young and I'm still a lot to learn. Sol Jake, it was an absolute pleasure having you on the show. Yeah, thanks so much. Thanks for joining our...

...conversation with Jake Vander plas about the role of data science in astronomy and academic research at large. We saw the challenges faced by not only the amount of data collected and streaming in, but also the different types of data that it collected these days. We discuss the importance of the open source development world, the role it plays in scientific research, the need for community in science and the need for a more serious conversation about data ethics. Make sure to check out our next episode, a conversation with Emily Robinson, a data analyst at Etsy. Emily and I will be talking about online experimentation at Etsey and e commerce website focused on handmade and vintage items and supplies and how data analysis and data science are essential to their business. Will also chat about much more, but you'll have to wait till next week to find out. I'm your host, Hugo bound Anderson. You can follow me on twitter at Hugo bound and data camp. At data camp you can find all our episodes and show notes online at data Campcom community slash podcast.

In-Stream Audio Search

NEW

Search across all episodes within this podcast

Episodes (121)