Rohin Shah — Lynette Bye

This interview is part of the “A Peek behind the Curtain” interview series.

Rohin Shah is a Research Scientist at DeepMind studying methods that allow us to build AI systems that pursue the objectives their users intend them to pursue, rather than the objectives that were literally specified. Rohin completed his PhD at the Center for Human-Compatible AI at UC Berkeley and publishes the Alignment Newsletter to summarize work relevant to AI alignment.

In this interview, Rohin and I discuss his advice for careers in AI safety, as well as his productivity style, experience with research, and personal career path.

Note: This interview is from 2020 and some parts may no longer accurately represent the guest's views. The transcript has been edited for clarity and readability.

——————————————————————————————

Lynette: Context for the interview today, I'm trying to do interviews with a number of people in different levels in their career to basically put together collections of tips for people who would like to go on these career paths, how did you get where you are, how might they do similar things, and what are weird things they should be looking at for anything they might consider now, all that stuff.

Rohin: Okay, sounds good.

Lynette: Cool. For starters, I know that backchaining is near and dear to your heart, and I'd love to hear your rant on that.

Rohin: Oh my God. I’ve got to load up the rant. EA is essentially a community that involves a bunch of people who are trying to improve the world. A lot of them try to do it by research or thinking maybe more generally, just like, "We want to think, figure out some things to do, and then if we do those things, the world will be better." An important part of that, very obviously when you say this abstractly, is making sure that the things you think about matter for the outcomes you want to cause to happen. When you put it that way, it seems pretty obvious that you should have this belief, maybe we could call it a theory of change.

In practice, it seems to me that what happens is people get into an area, look around and look at what other people are doing, spend a few minutes, possibly hours, thinking about, “Okay, why would they be doing this?” and then they have at least some tenuous connection to outcomes that I wouldn't buy and I don't think they should buy. That's enough for them to get started thinking about the actual narrower question that they've chiseled off for themselves.

This seems great as a way to get started in a field. It's what I did. But then they just continue and stay on this path, basically, for years, as far as I can tell, and they don't really update their models of "Okay, and this is how the work that I'm doing actually leads to the outcome." They don't try to get a better sense or try to robustify that argument or look for flaws in it, see whether they're missing something else, figure out what other things need to be true in order for this argument to go through and maybe try making those other things also be correct or looking there as another piece of the puzzle to do work in.

Most of the time when I look at what a person is doing, I don't really see that. I just expect this is going to make a lot of their work orders of magnitude less useful than it could be.

Lynette: I'm curious if you can share an example when you did this, maybe your last career decision, maybe something else. Walk us through how you went about that process?

Rohin: I don't think I did it during my career decision, I think I did it more after. Not my most recent career decision, but the pivotal one for me was when I decided to switch into AI safety. I knew very little about AI safety when I made this decision. Basically, the impetus for it was that I thought about population ethics a bunch, became more convinced that the total view of population ethics was something that I should put a decent amount of weight on, which in turn implies putting a lot of weight on longtermism, which in turn implies reducing existential risk is very important.

I got to there and I was like, "Given my particular capabilities, I would be a really, really good fit for AI safety. I don't really buy the AI risk arguments, but I definitely can't rule them out, and I haven't spent that much time engaging with them. I'll just start working in the area and learn more." So I started at CHAI.

I was lucky enough to be able to get into CHAI basically immediately. There was maybe a month gap between, “Okay, I want to do something longtermist” to I started working CHAI, it was an extremely fast decision, at least by my standards of how fast career decisions should be made. When I started, I had no clue, basically, how my work was going to lead to outcomes that I care about, no idea of what I should be thinking about or should be researching.

I think what I ended up doing is I spent maybe 50% of my time just reading existing things that people had written in AI safety. I used Victoria Krakovna's list of AI safety resources as a starting point.

By the end of the first week, I had an idea for a thing to work on. It later turned into a paper about reward inferences when you've got a biased demonstrator. I won't go into the details, but I had some very tenuous idea that the goal in AI safety was “Learn the human utility function.” A challenge to that was that humans are biased, so you can't just assume that they're optimal and then infer that reward function from that. So, how do you deal with the biases? I had one particular technical idea for that, so I started researching it.

Crucially, I did this for maybe 20% to 50% of my time. It's a little hard to remember, it was three years ago. I continued spending 50% of my time just reading or doing some other form of learning about AI safety for the next, I want to say, six-ish months. I think by three months in, I had decided that actually, this approach I was thinking of was not likely to work, so probably it wasn't that important. I ended up finishing it anyway, out of a principle of: it's good to finish projects because, as you go through them, you often think that they will be bad or not that useful, but sometimes they'll be impactful in a way that surprises you. At the very least, you want to tell everyone else, "This is what I did, here's why I think it's not that good, and you shouldn't be thinking about it."

I think the biggest thing here was, I was just reading a bunch, learning a bunch from reading all of these papers. It was that that allowed me to realize that the idea that I was working on was not actually very important. It wasn't the working on the idea that told me that.

Lynette: Were you doing anything besides reading papers? Reading other books, textbooks, anything like that or doing any writing, even if it just like, "Here's my ideas, here's notes on stuff"?

Rohin: I talked a lot with CHAI grad students. I did more reading by time spent, but I did still spend a lot of time chatting with CHAI grad students, most of whom at the time knew more about AI safety than me. For that paper in particular, I especially benefited from a conversation with Dylan Hadfield-Menell, who was a senior grad student at CHAI, but I think I was probably spending maybe like 5% to 10% of my time just talking with CHAI grad students and trying to figure things out. I would have spent more time talking with mentors, had there been any mentors to talk to. I talked to my advisors, but Stuart wasn't that available at that time and my other advisors, while they're great for AI research, they don't think about x-risk as much.

Lynette: How did you come out of this with a theory of change? What was formulating in your mind?

Rohin: That's hard. I think I didn't really get to that point until maybe two-ish years after my start at CHAI. As to how, I think most of that time was spent trying to build, I wouldn't say I was explicitly doing this, but implicitly, what was happening is I was building a model of how the future progresses. What are the key points? What are the key problems? What are the key decision points? Where's their leverage to affect the future?

That was a small part of it, really. Then there was also, okay, how does AI work? What does the future of AI progress look like? It's sort of the same thing but specifically for AI, what are the different problems people are concerned about? How do they relate to each other? Take two very different people who both say things about AI safety and try to figure out how both of the things that they say can be true at the same time or conclude that one of them is just wrong; I did that operation quite a lot.

Really, most of this looks like building an internal model for what powerful AI systems will look like and also, to a lesser extent, how, if we have powerful AI systems, they will be used. Once I had that model—I mean, I wouldn't say I have that model now, but as that model got more refined, it became relatively obvious where the intervention points were and what the theory of change for different interventions would be. So, I don't think the theory of change was ever very hard. It feels like the difficult part was having the model in the first place.

Lynette: I'm wondering if you could just give a step-by-step, on-the-ground detail for one node of that model, like what papers you read, what conversations?

Rohin: Okay, let me think of a good example. A pretty central idea that I think about is what we might call inner alignment. Let me even not explain what that is yet and just go through the process that ended up in having this concept. I read some very esoteric posts on AI safety. Esoteric in the sense of crazy-sounding and weird and quite hard to understand. I think one is called The Universal Prior is Malign by Paul Christiano, and then there was Optimization Daemons, which is an article on Arbital, presumably by Eliezer although Arbital often doesn't say who wrote the article.

I read it. I did not understand it. I was very confused by it. Notably, I was confused by it; I was not like, "This is wrong." Usually, I read a thing and I'm like, "Probably right or probably wrong" and then sometimes there is utter confusion. There can be some amount of confusion with the "probably right" and "probably wrong," but usually, it's not huge. Whereas here, it was just very confused. "Should I believe this? Should I not believe this? Is it likely?" That was my original state.

I then spent some time trying to resolve the confusion by thinking about how I could instantiate an example of an optimization daemon in neural net because neural nets are things I understand. I can write code for neural nets. Neural nets, you can just ask, mechanistically, "What happens after you follow each line of code?" It's fully mathematically determined. Give me enough time, I can understand anything about a neural net. Enough time might be a long time, but hopefully, for a small neural net, it's a reasonable amount of time. I tried doing that, did not succeed. Went to a workshop, talked about this with other people at that workshop, tried doing it with other people at the workshop, still didn't succeed. I think there I was like, "Okay, it seems like, in the abstract, some of these arguments make some amount of sense, and it feels like it could maybe happen, but I can't really figure out the concrete mechanism by which it happens."

I started a project. Well, "project" is maybe stretching it. I started working with Owain Evans, who at the time was at the Future of Humanity Institute. This probably went on for like, I don't know, 10, maybe 20% of my time for four to five months, which was "Okay, making a whole daemon seems pretty difficult. Can we make a part of a neural net, essentially, that we were thinking about on the MNIST task, which is a digit classification task. Can we make a neural net that consistently classifies threes as eights and even when you train it on good data, it still will classify threes as eights? So basically, something that persists?" It makes a mistake, and it keeps making that mistake, even in the face of selection pressures that are trying to eliminate that mistake, which is a key feature of optimization daemons. Worked on that for a while, didn't get that to work either, but I think at that point, I gave up on the “make a optimization daemon via a neural nets” path and instead was like, "Okay, this is not actually a good way to make progress on this project of understanding what the hell daemons are and whether or not they can arise." I shelved it and worked on other things.

I was continuing to read stuff in the meantime and talking to people, so I talked to a few people at MIRI, and I often talked to them about optimization daemons. At some point, I got convinced that this was basically just the robustness-to-distributional shift problem in machine learning. We discussed this during a CHAI meeting at some point, got some good feedback there. I think eventually, what actually changed it was I, on a separate thread, was learning more about how I expect AI systems to work. As that happened, I think I mostly still agree with the point of daemons are just a special case of a robustness-to-distributional shift. As I learned more about neural nets, that got fleshed out more and more, and I got to the point where that model was fleshed out enough that I actually believed it.

I still have not constructed an example on a neural net. Now, I think this is mostly because neural nets are not sufficiently capable, at least at academic levels of compute, for me to make one but that something analogous could be shown in the future, and I hope to show it at some point. There was also a MIRI paper on this topic which was also helpful. I think I had gotten most of the way there before that paper, just from talking to people and thinking about it myself. I feel like I haven't actually analyzed what the steps were.

Lynette: I have a model of what you're doing, and I'm going to run it by you and see if that sounds right. You started out with a year or two of just intense learning about the field in general, trying to understand it and build your models, and then after you'd gotten to the point where a lot of it made sense, when something just sparked utter confusion, you spent more time going deep into that. You started with a pretty small exploration, you were just thinking about it yourself, then escalated a bit, talked to more people, and kept going until you'd escalated it all the way to doing a project on it with 10% to 20% of your time for a good chunk of a month. At that point, you still weren't getting anywhere, so you shelved it and have been thinking about it off and on since, as you get other information, but you're not putting most of your time for it. Does that sound like the right arc?

Rohin: Yes. I should note I did not do two years of learning before this, I did two months. I think I started doing this in, I want to say, November or December of 2017, which was two or three months after I started at CHAI. Two years after I started at CHAI, I had a good model of this, I was no longer confused by the concept.

Lynette: I'm curious, these models you're making, how are they connecting to your day-to-day activities? How much of your time is directly chosen by activities that you think connect to one of these nodes versus being in more reactive mode or just following what's exciting?

Rohin: I think for this specific example, it was almost always following what's exciting. Exciting might be the wrong word, following what's confusing might be more accurate. I think when I hear people talking about following what's exciting, it feels like it matches what I'm calling following what's confusing. The reading was definitely a deliberate choice. It was also very exciting. I very much enjoy learning giant amounts of stuff. That was the first time I'd gotten to do it since undergrad. A new field is just so much knowledge to be ingested, and it's less true now because now I have good models of the field, and each new paper is a fairly incremental piece on top of it. Arguably, all of it was following what's exciting, but it definitely was a deliberate strategy to learn a bunch of stuff.

The part where I connect these models to things, like “And now I want to learn about inner alignment or optimization daemons”, that part was much more like, “Notice confusion or notice opportunity or something like that and then do a dive into that.” I feel like usually, what I'm doing is doing a bunch of generically useful stuff like reading papers, executing on some projects I've already chosen, and waiting for inspiration to strike. Inspiration can come in the form of me just thinking and coming up with some idea, or it can come up by me reading a paper and being like, "Oh, man, that's cool, but they should have just done it this way. It would have been way better. What do you know, I'm going to do that." It could be reading someone saying confused things about some topic and being like, "Wow, that is confusing. Let me not be confused about that anymore" or talking to someone-- Lots of ways that inspiration can strike. I think a lot of my strategy is, make sure you always have free time to do things when inspiration strikes.

Lynette: Hearing you, it sounds like you're very excited for most of your work, I'm curious if there's counterpoints of that, is there work that you have to force yourself to do or that feels like drudgery?

Rohin: Yes, definitely. I think it's definitely become less exciting after-- I've been describing those initial years at CHAI, and those were, I think, the most exciting just because there was so much learning. I think as I became more experienced and more senior, it's become a bit less exciting. The act of writing papers, it's fine, I wouldn't call it exciting. I do write a lot but all of that writing, it's not painful, but it's not exciting either. That one is definitely much more of a "this is the way you have impact" type of motivation.

What else? I feel a bit bad about saying this, but when people reach out to me being like, "I want to get into AI alignment." The first few times, this was super-exciting when they asked for my advice. Now at this point, I'm like, "Oh, man." It's not actively bad or anything, but it's not actively good either. I could tap into excitement about this thing like, "Oh man, there's another person who is helping save the world,” but the actual motivation that gets me to do this is more a habit of actually responding to these and giving people advice. It's more just execution of that habit than feeling excited by it and doing it because of that.

Lynette: Do you think that being good at what you're trying to do for you, personally, is strongly correlated with being excited about it? These parts of this that feel like drudgery, do you feel like you can do those as well as the parts that you're more excited about?

Rohin: That's a good question. I would guess, just a priori knowing me, that it would be pretty correlated. I think I am reasonably good at analytic writing. I don't think I am very good at giving people advice, which is maybe not the right thing to be saying on an interview that is meant to be advice for people. [laughter]

Lynette: Maybe we can let you give some generic advice here and then you don't have to give it to all the individual people.

Rohin: Amazing. I would guess that it’s correlated. Writing feels like a weak counterexample. I do think I'm reasonably good at the sort of writing that is needed for papers and intellectual progress style blog posts. I also enjoy teaching. Writing is like a specific kind of teaching. If you want to explain yourself well, that's a pretty similar skill to teaching. I do think I am reasonably good at teaching. The correlation could very well be quite high.

Lynette: If you're looking at yourself and other people that you know well enough to answer this, do you think that not being intrinsically excited about your work or the field is a huge handicap? For people who are motivated by wanting to do good but don't find it intrinsically exciting, how big of a handicap do you think this is?

Rohin: My intuitive answer is "man, that sounds like a huge handicap if you're trying to do anything that involves research-style thinking.” I think that it's probably an overreaction. Let me think about specific other people. I do have trouble thinking of examples of people where I know that they are not very intellectually engaged by their work, yet I think they are doing excellent work. But also, there's not that many people where I would know one way or the other if they fit in that bucket. I think overall, I endorse the reaction, but not the strength. It seems like a pretty big handicap. I wouldn't be surprised if it just basically forced you to be below the 80th percentile. Like, just that fact alone tells me you're below the 80th percentile with very, very high probability before even updating on, I don't know, intelligence or some other factors. I could also see it being like “It's a handicap but depends on the person.” Where on average it's a handicap, but there are definitely some people where it just doesn't make any difference at all.

Lynette: Cool. Getting into the weeds of advice here, for people trying to get into AI. Let's say they're just starting, they've heard about this and are like, "Okay, this sounds important," how would you recommend they start out?

Rohin: I feel like despite the fact that I said I started out doing a lot of learning, if I were to go back and do it again, I would allocate more of my time to learning, not less.

Lynette: You started at 50%, you said or similar?

Rohin: Yes.

Lynette: Where would you try and start at?

Rohin: In most cases, if you have a job, you're not going to get the slack to even do 50%, but if we ignore that outside factor and somehow assume that you have infinite slack, I don't know, 80% or 90% seems pretty reasonable to me. I think so. Especially if you're an incoming PhD student, I would also count “learning how to do research” as learning, and that's an important skill. If you're thinking about, "Okay, how much time do I spend on research versus just figuring out what the hell AI safety is?" I'd be more like, "Okay, when you do your research project, you're learning how to do research." That's a pretty important skill. Personally, I already had most of that skill in that I'd done like three years of a PhD in a different area before that, so for me, I think 80% to 90% on just learning AI safety was about reasonable, probably better. For an incoming PhD student, I think a 50/50 split would probably be about right.

Lynette: How much of learning should just be reading versus taking notes, trying to build little toy models like, "Well, maybe this causes this, so I should read these things," or anything like that?

Rohin: I see. I definitely took notes, that is included in my 50%. I wrote toy models but more when inspiration struck, so it was not a significant portion of my time. Inspiration does not strike very often.

Lynette: Did you deliberately set aside any time for thinking about the big picture and trying to grok how things fit together?

Rohin: I don't remember doing that, but I might have. I think if I were to go back and do it again, that would probably be a good idea to do.

Lynette: Got it. Do you have any particular reading lists or places where you recommend people start, normally?

Rohin: Oh, man. This really is a thing that should exist but does not exist. I used Vika's AI safety resources. It still exists. It has become bigger. It still is a reasonable place to start, but it's big and it's not obvious how to prioritize within them. There are a few sequences on the Alignment Forum that are recommended. Those are pretty good. Those seem like reasonable places to start, but it’s still not really an overview. They are just specific sub-parts that I particularly like. The book Human Compatible is pretty good, but again, it's like one specific perspective within AI safety.

Lynette: Cool. The other things that I've heard are the curated sequences?

Rohin: Yes. That's the one that I mean.

Lynette: Of course, the AI newsletter, as ongoing reading?

Rohin: Oh, yes, right. [laughter]

Rohin: I think I actually don't recommend people to look at the Alignment Newsletter as much as those other things. I do think it's like worth subscribing too. The newsletter is good for getting a sense of what people are currently doing and also, to some extent, a sense for what I think is interesting and care about and also what I have expertise in. There's lots of stuff I think is interesting, but that doesn't make it into the newsletter because I just can't really feasibly follow that area of research and summarize it, then talk about it in the newsletter. There's a bunch of selection of facts. It's like if someone wanted to understand geopolitics, reading the news is important, but before you go and read the news every day, you probably want to start with some textbooks on geopolitics. That's how I feel about the Alignment Newsletter.

Lynette: Good to know. Let's say someone, they've started doing this learning. They've been reading a lot. At what point should they be trying to replicate papers themselves or do some project or otherwise get direct into work and how do they tell when they should be trying to start doing something?

Rohin: I think you should probably start trying to do something nearly immediately. I started after one week. I got lucky, inspiration happened to strike in the first week. That does not happen at that frequency anymore. I think that was just pure random luck, so I wouldn’t anchor on one week. As soon as you feel like you have an idea that's worth thinking about it, that feels exciting to you, I would probably do it just because there are things you're not going to-- In any sufficiently complicated enterprise, there's an amount you can learn by peering in from the outside and there's an amount you can learn by being on the inside and actually chipping away at the thing. For example, to take a recent example that has been on my mind. It's very, very possible to notice all the problems with academia from the outside and to write lots of good things about all the outcomes of academia that are bad. If you want to try to change academia or make recommendations about how the system should be, I just do not expect these recommendations to be very good unless you have spent some time in academia. Similarly, if you're just learning about AI safety, you get a good sense of what the space of problems are that people are thinking about. But if you don't spend some time attacking one of the problems yourself, my guess is that you don't get some important felt sense or intuitions about what's challenging within a problem, what's the best way to go about solving one. Something like that.

Lynette: When you're saying "attack a problem," are you saying, try and do novel research or pick an interesting paper and try and replicate it as a starting point?

Rohin: Try and do novel research. I think the reason to replicate a paper is to understand the techniques in that paper, which can be important, especially for incoming PhD students or people who are starting in AI alignment. One thing I want them to know is just, how does deep learning work? This is not novel research.

Lynette: A lot of the people that I've spoken to about this are also often pretty new to ML, figuring out how that works.

Rohin: Yes. I do think they should be replicating papers almost immediately. Probably, they should be learning a little bit about ML, they should know what a training set is, they should know what overfitting is, they should probably understand the concept of cross-validation. First off, there is replicate a paper and there's do course projects from an actual machine learning course. Probably, they should start with course projects from a machine learning course. There are several that exist online. I don't know which ones are good, but I'm sure some of them are. That's like a clear first step.

The reason to replicate a paper is that course projects are designed to be easy to complete pretty quickly and don't give you a felt sense of all the things that you need to do in research-style machine learning. Replicating a paper is the closest you can get to learn how to do research-style machine learning without needing to have a novel idea. The skill I think you're practicing there is, “how do you do machine learning in a context that no one else has done machine learning in.” Except one other person has done machine learning in that context, and they didn't give you good instructions on how to do it because papers never get that.

Lynette: Got it. This is a hard question, but I know a lot of people are trying to get into AI safety. They're exploring this. I don't know how long they should keep trying, how much they should be learning before they're either applying to jobs or deciding they should do something else. How would you make those decisions? How would you say, "If you don't pass these filters, let yourself off the hook and go find your comparative advantage"?

Rohin: It's hard to provide general advice for this. I'm going to have two caveats. The first caveat, luckily, I have never been in this particular position, which is great for me but means that I am very much not the best person to give advice on this, having never experienced it. I am generally skeptical of advice that comes from people who have never experienced the thing they're giving advice about. The second caveat is, like for any given person, if they find themselves not having a good fit for AI safety, there's two updates you can make. You can make an update that AI safety, in particular, is not a good fit. That's a comparative advantage style update, or you could make an update on your baseline level of intelligence or something which is an absolute advantage type of update. I'm going to assume because the framing of the question assumes that you're making a comparative advantage style update, but you should definitely be thinking about whether it's actually a competitive advantage or absolute advantage that you should be updating on.

Lynette: If you have different criteria for how to tell them apart, that would be really interesting too.

Rohin: Yes. I'm really just reasoning from first principles on the spot, so take this all with massive grains of salt.

Lynette: Teaspoon of salts being added.

Rohin: Is teaspoons enough? Maybe teaspoons is enough. The first thing is, “Is it exciting?” I would be pretty surprised if people could not find things that were exciting to them. At least for the distribution of people in your audience, I would expect them to be able to find something that is intellectually exciting to them. If you're not excited by AI safety, I would expect that to be a pretty large handicap that you wouldn't have elsewhere. Oh, a sort of older point I should have made is that even an absolute advantage update of the form, “Maybe I'm not that intelligent or something,” is still a competitive advantage update on some strategy like earning to give, which does not really have a minimum intelligence bar, or it scales much more nicely with intelligence than research-- The impact of research tends to be pretty heavy-tailed with researcher ability, which probably but not certainly will correlate strongly with intelligence. Anyway, I'm super-disjointed.

Lynette: The question is, how would they tell that they should make this update and move on to something else?

Rohin: Excitement seems like a good one. I think using earning to give as a benchmark to be comparing against is a pretty good method. I don't know, maybe you value being in the 90th percentile. If you were the top 10% of AI safety researchers-- just choose some number for it. I don't really know what number you could choose. It just depends so wildly on the strength of your beliefs, but call it a million dollars per year, which is a reasonable number you could have. Then you can make a distribution of how much money is it worth to be in the X percentile of AI safety researchers. That gives you a quantitative way to compare to earning to give. Hopefully, people have a pretty good sense of what their earning potential is. That's one way if you're particularly impact-focused and you are willing to take an earning to give job.

Lynette: How would they benchmark the percentile they would be in apart from applying to places and getting jobs? Most people are shooting in the dark. One is, maybe could they replicate papers within some amount of time, a week, a month? I don't know whether you benchmark that as a good sign or bad sign.

Rohin: Here's the thing. I've never actually replicated a paper myself, so I don't actually know how long it takes. [laughter] More accurately, I have sometimes tried to replicate papers and gave up. I gave up too quickly because I had a specific use for replicating the paper and I tried for a week and then I was like, "Well, it doesn't seem like I'm getting anywhere. I'll just give up." I don't really know. 100 hours seems like probably enough, assuming you have the amount of compute. It also depends on the paper. It depends on how much starter code you had access to. Did you allow yourself to look at the author's code? I don't think I can give you a number for that one. But I agree that's a reasonable test of whether you should be going down the ML track of AI safety. There are other tracks in AI safety, there's conceptual philosophical contributions. The career prospects for that are not as good. You have to build your own career. I am hesitant to recommend this as a path, but it is a path that exists. For that one, I think the test would be, try clarifying some currently conceptually confusing things in AI safety, write a post about it, post it to LessWrong or the Alignment Forum, see what its reception is.

Lynette: People go through this, they do some reading on the field, and then they're applying to works. Let's say they're either currently not skilled enough, maybe they're too new to ML or they're currently not qualified enough to get into the canonical orgs that comes up whenever EA's talk about AI safety. Are there other places that are good for skill-building they could go?

Rohin: PhDs are an obvious one and I think being able to get into a reasonably prestigious ML PhD is a pretty good indicator of skill. Well, it's a pretty good indicator of something. Do I believe that it's skill? I think it's definitely a good sign. I don't know exactly how good a sign is. I think if you get into a place like Berkeley or MIT or Stanford or a CMU, that I think is, in fact, a good indicator of skill. I don't know as much about application processes for not-top-five universities.

Lynette: What about industry jobs, such as all the big tech companies often have an AI lab?

Rohin: Yes. If you can become a research engineer at one of those places, that seems pretty reasonable for skill-building. You presumably can't become a research scientist, or one of the other paths would have been open to you, probably. But a research scientist would also be great. If you can become an application engineer of some form, I think it depends.

Lynette: What about government jobs?

Rohin: That seems good for AI policy or AI governance. I would be pretty surprised if this was relevant for technical AI safety. There are some government-affiliated think tanks that do AI research. I think these exist, I don't know how good those are. I would neither recommend for or against those.

Lynette: Is it valuable to produce public-facing something, like "Here's a replication of a paper I did" if you're trying to explore this and then applying for jobs?

Rohin: Yes, that is definitely valuable, just in that it's good for your CV. I think one example is Matthew Rahtz, who was a CHAI intern. I think when I was evaluating CHAI intern applicants, I looked and just saw his name and was like, "Ah, I remember reading a replication of deep RL from human preferences that I think was by this guy." I checked and it was, or possibly he mentioned this on his application, one or the other, and like, "Man, I feel like I'm willing to admit him just on that basis." This was not just the fact that he had replicated it, the blog post was also quite good. We still did do the rest of the application process as a sanity check, making sure that nothing else came up, but I felt more confident about Matthew than about basically any other candidate during that application process.

Lynette: For a bit spicier question, what things do people in AI safety not say aloud? What's not talked about?

Lynette: I'm curious, what tracks are there in policy that people could be working on? I have a sense this is super important but almost no sense of what a person could practically do.

Rohin: Sure. One second. Let me make sure I didn't forget to say something on the previous question. I think I should note that this is pretty conditional on my beliefs of AI safety is not the end all be all cause, which some people do believe. I think it is not a majority opinion. I think mine is probably the majority opinion, that AI safety is important, but there are other courses that could plausibly be similar in importance. There is your uncertainty caveat for that question.

Lynette: Sure.

Rohin: Cool. Watching people doing AI governance or policy. I follow AI governance and policy and think about it somewhat, but you should totally ask an actual researcher in the area, but some thoughts. If I were going into AI strategy or governance or policy or whatever you want to call it, AI governance I think, I would be interested in the question of, “To what extent is the goal to give decision-makers accurate information about risk versus assuming they had accurate information about risk, develop mechanisms that allow people to coordinate given that accurate information risk?” That's a decomposition you could do of the governance question. A lot of work is done on the “Assuming people want to take actions to reduce risk, what should they do?” So, there’s work on how can you do verification of claims about software in a way that's compatible with governments using this. How do you monitor all of the hardware that exists such that you can force a law of the form, no one can use more than X amount of computer without first going through this review process or how do you-- Had another example and completely forgot it. There's been a lot of work on that. That's also something that people could work on.

I'm more interested in the question of, all of that assumes that people wanted to take these actions. You're not going to get anywhere if people don't think that there's any risk to be mitigated. People currently don't think there is risk to be mitigated and it's not obvious that they will think that there's risk to be mitigated. Maybe if we get the information to them such that they now have accurate beliefs about the level of risk however high or low it may be. Maybe if that were the case, coordination would just happen. I don't know. I don't think you can be quite that optimistic that coordination will totally happen if they have the right information. But I don't know. I could see pretty high levels of optimism. In which case, nearly all of the impact is in getting the right information to the right people and producing that information in the first place. It seems like an important research prioritization question within an AI governance.

I think the best questions, at least as of a year ago in AI governance were of a similar form, which is just what research even should we be doing? How do we break apart the problem into questions that can then be individually answered by a person going and thinking about it for a long time and reading a bunch of stuff? It's possible that now enough progress has been made that this is no longer the best thing to do on the margin.

Lynette: Changing tracks a bit here. One of the things that I'm getting with these interviews is a lot of examples of how people have done things successfully. I'm curious if you've done a similar process in your own life of looking at successful people, either asking them or copying things that they do to try and narrow down the paths that you could follow to success.

Rohin: Not in the traditional sense, I don't think. I have had a very weird path and also was not super trying to optimize my career until fairly recently, like four years ago, maybe five years ago. I think an example of this that's a little different is when I was trying to build my models of how AI will work. One of the major things I did was “Can I take some person who's already working in AI safety and has written a decent amount about their views? Can I inhabit that person's perspective and explain why they have these views?” Frequently, I couldn't, but some of the times I could. I think even when I couldn't, the act of trying gave me a lot of information. Imagine that is like giant haystack of possible things you could think about AI safety, and then there's this one needle that's Eliezer Yudkowsky. Then you're like, "Okay, well, I don't really know exactly where Eliezer's needle is in the entire haystack, but I know it's in this tiny little portion over here based on things that he's written." A lot of the things, that already narrows it down a lot, and then I can think about, "Okay, well, it can't be this particular view because that makes no sense. Maybe it's this other thing." It narrows down your search space a lot. Rather than sifting through the whole haystack, you're like, here are these 10 different spots where it could be out of millions of possible spots, and search only those 10.

Lynette: Got it. Thanks. Did you ever do anything to verify your guesses that you made about a person compared to future papers they publish or anything like that?

Rohin: I looked back a year later at these, I was like, "Wow, that's clearly not what they believe often." I didn't actually verify it with the people in question. Partly because they're busy, but partly because I didn't want to be given the answer. I think you get much worse beliefs if you are given the answer to the questions that you ask. Whereas if you come up with the answer to the question that you're asking you will, along the way, build a bunch of important beliefs that you wouldn't if you were just told the answer. I think it's a pretty standard pedagogical result, but maybe not.

Lynette: Going a bit further back, if you could send a list back in time to your freshman college self, what general advice would you tell yourself and why? This isn't the deepest person type of advice; this is the generalizable.

Rohin: I think freshmen me was not particularly optimizy. That wasn't really so much that I wasn't an optimizing type of person, so much as I didn't have anything in particular I wanted to optimize for. In that sense, almost probably the best thing to do would have just been like, "Here is EA. This is it." Or like, even just, have you considered taking actions that maximize societal welfare or something? Or have you considered just altruism but effective?

Lynette: After you have a consequentialist freshmen, what would you tell them then?

Rohin: General advice.

Lynette: Imagine you wanted to come out of undergrad as best positioned as you could to go and do useful things.

Rohin: Right. There's a bunch of information I could give, which is like when you get confused about population ethics, go read Arrhenius's paper on it. That would have sped things up a lot. Actual, policy level things I could say that weren't very dependent on the actual path that I ended up taking. It's not that obvious. I definitely had spent a lot of time on being very good at coursework. Which, I'm not sure that this is wrong. I don't know, it's a pretty common belief with many EAs that you should just try min-maxing your courses. This is plausibly just correct for most people. I actually did benefit from being very, very good at courses, among other things. I felt to this day, I suspect, I could still pass most of the CS courses I took without any studying. If I got an exam on the material that I was taught then, not the material that's being taught now, which is quite something. Definitely would not get an A. But passing, I think I can manage passing. This has just given me a broad base of knowledge for CS that has been useful in ways I wouldn't have been able to predict. I don't think I want to advise freshmen me to min-max on courses. But also, I remember my courses. I infer that most people don't. Most people should probably min-max their courses.

What are other things that people would often say? Maybe I would tell freshman me to learn more about the world. I feel like a lot of the stuff I think about now is somewhat abstract modeling of the world plus somewhat just looking at the world and predicting how it will go. I don't think I'm all that good at this because I never practiced it. I just growing up, including in college, did not keep track of current events. I don't think current events exactly was the right thing to do but studying history possibly, studying evolution, psychology maybe. There are lots of other ways to understand the world. I'm currently going off of like half-formed intuitions plus like reading random blog posts. Plus, talking to people who have actually studied it more. Presumably, I could be a lot better if I had actually learned about that stuff earlier. It's not obvious that I should do that rather than just charge as fast as I can into AI safety research. Having the broad base of CS knowledge, I'm pretty happy with. I would recommend doing that again. Broad world understanding is a much, much larger endeavor, and can only be done at a very shallow depth. I'm not sure that that's worth the opportunity cost.

Lynette: An open question that a lot of people have, particularly people doing PhDs, is how do you go about doing research that will actually produce useful results, given that you're going into it not knowing what the end result will be, usually. It’s kind of the reason you're doing the research. How do you go about it to produce something that's more likely to be useful? Be curious if you could walk us through something you did. Maybe a paper you liked? What was the process you went through?

Rohin: A, I don't think I have some magic answer to this. Any of my research projects, maybe I'd give them a 10% chance of mattering, 20 perhaps if we're talking about specific research papers. There's other things I've done, like the alignment newsletter, or some sort of conceptual clarity type stuff, where I'd give it a higher chance, but like actual papers in areas that people have already thought about, not very high.

Lynette: I have a feeling that is still maybe better than the average for academia.

Rohin: I think it is. I definitely do think my papers have, especially ex ante a higher chance than your typical paper. Part of this is just bothering to backchain it all. I frequently read papers where they've got this introduction and motivation section. You read the motivation and you're just like, "This cannot possibly ever work in the real world." Before you even get to their method, which is going to be some much, much weaker version of the thing. Take your motivation, take the thing you're trying to accomplish, imagine it went as well as it could possibly be expected. Look at the output and then imagine that another 10 years of research happened on the same topic with the same goal as you and the thing you were trying to do really just worked. How does the world change? I think this can eliminate a decent amount of ideas.

Lynette: Do you think it's possible to do this before you have a good grasp of the field? Like before a pretty comprehensive overview, at least.

Rohin: At least some of the papers in AI, totally. These are just-- I don't know. There are definitely just some papers where I do this. I'm like, "How could you have ever thought that this was a thing that would happen in the real world?" It feels like what happened in that situation was that they were like, "Aha, we have this dataset, we have this algorithm, we could try running the algorithm on the dataset, and see what happens." Then they get some results and then they stick on some motivation and some conclusion and call it a paper. I just don't know what model of the world could cause them to produce that paper if we assumed that the motivation came before the work. “Bother to backchain” is I think my first piece of advice. I do think EAs and rationalists tend to be at least that good to bother to backchain at all. Probably people in your audience will already do this. I think after that, I think my advice is something like, "Have good models."

Lynette: Stepping away from the advice, I'm also just curious to hear the story of how you apply this yourself which often contains insights that you wouldn't capture in advice.

Rohin: The thing is I don't feel like I apply this in any obvious way. Concretely, here's the things that happen from “idea” to “project completed.”

Lynette: Cool.

Rohin: First thing is I'm showering or something and idea pops into head. If you take the paper that I think was the coolest paper I did, not necessarily the most impactful, but the coolest paper I did. That was literally one that happened in the shower where I was thinking about why it's the case. Why is the case that we think that doing nothing is better than a random action. As I was thinking about that, I was like, "Ah, that's the reason." Then I was like, "And you could probably turn this into an algorithm for inferring things about human preferences." I was like, "Oh, man, that sounds good." That was the idea. I don't remember exactly what the timescale was. I thought about it a bit and was like, "Yes, seems solid." Then I sat down and wrote a Google Doc where I was like, "Okay, here is a full plan from now where all I have is an idea to ‘and now the paper is published.’" I wrote down this plan. It had like, "Here are the experiments we're going to run, here's how we're going to develop an algorithm, we're going to take this paper, we're going to modify it in some not totally obvious way. I don't know exactly what it'll be but it'll be something along these lines." I just thought about it.

Lynette: Did you actually use that doc?

Rohin: I used the doc throughout the entire project. I definitely did not follow the plan that was in the doc. That was never the purpose of the doc.

Lynette: [laughs]

Rohin: But, you know, it was reasonably close. It was far too ambitious in that particular case in what experiments we would run, but if you ignored the ambitious parts and only focused on the parts that did ever happen. I think that it more or less happened the way I thought it would. The initial way I was thinking of developing the algorithm, we tried it, it was kind of bad. There were issues. We tried to tweak it a bit. Then at some point, four or five weeks into this project, I was talking to Anca. Anca was like, "Maybe we should think about it this way and derive it using this form of math." I was like, "Oh, man that does sound better." We did it that that way. I spent two weeks on that. I got a different algorithm. That algorithm was way better because it was mathematically principled, we could understand it a lot better. We knew what it was going to do. It was excellent. Didn't end up following the plan really the way that it was written. It was too ambitious. It underestimated the difficulty of developing an algorithm. It had a slightly wrong mathematical formalism that was buggy in ways I hadn't noticed at the time or doing the wrong thing in ways I hadn't noticed at the time and figured out over the course of the project, but in broad strokes, it was correct-ish. I think the main thing I got out of writing that doc was being, A, convinced that this was a reasonable intern plus me project and that the idea is solid enough such that there will be a path to a paper and it will in broad outlines follow this stroke even though I'm sure roadblocks will come up. It was more like I had a much more visceral sense that these roadblocks will hit them and then we will probably surmount them. I wrote that doc then just shelved it until the interns came, started doing some stuff. I basically already told you the rest of the story where we tried the initial version of the algorithm that I was talking about, didn't work, eventually talked in one of our meetings with Anca. Anca made a great suggestion. It took a while for us to actually figure out how to do it but once we did, everything worked. It was beautiful. Then we wrote the paper and submitted it.

Lynette: Cool. Do you have any strategies you do for writing or editing to make that process efficient?

Rohin: No. I am a very intuitive writer. I just sit down, open a Google Doc and start writing. Sometimes I write an outline first.

Lynette: [laughter] Okay.

Rohin: I usually write like, "Here are the sections." I'll be like, "Here are the sections. Here's the overall flow." Often, I will just straight up write a section without outlining it first. I think this has mostly come about via practice. I remember doing a lot more outlining when I was a first-year PhD student.

Lynette: To get into some of your more traditional productivity stuff, I'm curious about how many hours you tend to work in a day.

Rohin: This is easier to do by week because it's more stable. I use Toggl to track my time. My Toggl tracked time, I think it probably started out at an average of maybe 35 hours and then over the course of a few years, it has been climbing upwards. Now it is maybe at 40, I'm excluding weeks where I take vacations, 40 to not 45. No, that can't be the average, 40 to 43, let's say.

Lynette: Day to day, what causes it to fluctuate?

Rohin: I have this very weird productivity style where the default thing that I do is “work until there is a reason to not work.” Then I allow lots and lots and lots of reasons to not work. It is a pretty common experience for me for someone to be like, "Hey, Rohin, do you want to play a board game?" in the middle of working. I'm like, "Yes, give me five minutes to wrap up this task and then I'll play." I had no plans to do this at the beginning of the day. That's an example of a thing that can cause a fluctuation. Sometimes I'm just tired and don't get as much done. That's not that rare. It does happen reasonably often, but we're talking on average once a week. Though, it's more often like there are some weeks where it happens twice or thrice and then many weeks where it doesn't happen at all. Other ways they can fluctuate. Sometimes I have to cook dinner for my house, that will knock off two hours usually.

Lynette: I'm curious, how is this sustainable for you? Just to make it be the default that you're working all the time and this can lead to long days.

Rohin: I feel like in some sense, the question doesn't make sense where I'm like, "I don't know, man. I just do it. It's sustainable." Maybe I want to ask for a more specific question, but some objections I usually hear or I expect people to raise to this. One objection I can imagine is that people would find it very stressful because anytime they wanted to do anything that was not work, they'd have to weigh up the consequences of like, "Oh, man. What if I did one more hour of work versus watching this TV show," or something. That would be stressful. I would really not like the system if that's what I was doing. In my case it's more like I ask myself the question, "Hm, do I feel like doing this?" Then if the answer is, yes, I do it. That this can lead to-- One thing that happened when COVID happened, we started quarantining and paradoxically I started having more social opportunities because our house is fairly big and now they weren't having social things with their friends outside the house. There was just a lot more socialization but in the house. I think it did temporarily lead to a dip in my times. Then in my weekly review, I was like, "Okay, I've done quite a bit of socializing. I feel like I've wanted to do a bit more work." I just slightly internally rose the level at which I was like, “This is the bar that needs to be met in order for me to do something.” It's still usually a question of like, do I feel like doing this? Which is a nice-easy question. Other benefits of this. That's like, sorry, forget the benefits. You were asking, how is this sustainable?

Lynette: I want to hear the benefits as well.

Rohin: One really nice benefit of this. You just never are scrolling aimlessly through your phone because you have to make an active decision to do that because by default you're working. That's just not a thing you do anymore unless you actually want to do it. Plausibly some people would want to like scroll through Reddit for half an hour. You can choose to do that. I just personally never choose to do it or very rarely choose to do it.

Lynette: How did you come to have this mindset?

Rohin: Accident. I don't know, man. I got to college. I kept wanting to do more things in college because there were lots of things that seemed cool. I did more things, somewhere along the line work became the default. I struggled to remember what exactly I used to do before that during my free time. I didn't scroll through my phone. I was lucky in that respect. I never developed the habit of scrolling through my phone.

Lynette: Do you think this would work if you didn't like your work?

Rohin: Oh, yes. That's another one. I'm not sure. It does seem to depend. I think it probably depends on not hating your work. I don't know if it depends on actively liking your work. I could imagine if you're like, “Eh, it's work. I don't mind it but it's not the thing that sparks joy,” that might be enough.

Lynette: At any point, do you notice that you get diminishing returns from continuing to work more time either in days or weeks?

Rohin: Yes, not in weeks, but in days definitely. In days, like the ninth hour, sometimes it's fine, but almost always it's quite clearly worse. The sixth hour, I think is usually fine. Like about as good as the first hour. Somewhere in there there's a threshold or just continuously degrading.

Lynette: Is your energy just a downward slope over the day or are there--

Rohin: It's basically just a downward slope over the day with maybe a upward slope for the first half hour while I'm waking up.

Lynette: If you don't mind sharing, I'd be interested in hearing a story of one of the biggest struggles that you've had to deal with?

Rohin: I think probably the biggest one was just for at least one year and probably more like two, I was in AI safety and I felt like I was floundering around not knowing what the hell I was doing. That's a long time to be floundering around and not knowing what you're doing. I think had it not been the case that I could see everyone else also floundering around and not knowing what they were doing, I might've given up. I remember also for a good portion of that time, the first eight months or so, I’d start doing a thing. Then by the time I was midway through the thing, I was like, "Oh, man, this doesn't seem like an interesting thing anymore. It seems unimportant. I guess I'll finish it because I should probably finish it. Also a paper is good for my career, but maybe I should just be dumping it. Maybe I just have sunk cost fallacy." I still think it is plausible I just had sunk cost fallacy. It definitely had a sense of like, "Man, am I accomplishing anything?" for at least that first year. It definitely wasn't a persistent thing or something that was always there, but it definitely came up every now and then.

Lynette: As you were going through that year, were you doing anything to try and test that feeling or gather more evidence about whether this be useful?

Rohin: I think the feeling was just very clearly correct. The entire point of my learning strategy was like have good models and then have beliefs that I actually believe in. I could tell that I did not have good models. I could point to specific things. I'm like, "Yes, wire-heading, I'm clearly confused about this concept and I cannot talk about it intelligently," or in a way that isn't spouting nonsense. I could point to you and say like, "Look there, that's evidence of nonsense." I couldn't tell you how to fix the nonsense, but I could show you that is clearly nonsense. I think I never really bothered testing the feeling or anything like that, just because it was very evident to me that it was true.

Lynette: It feels like the struggle here isn't that it's currently true. It's wondering if it will change.

Rohin: Yes. I don't think I explicitly did that part, or even implicitly did that. I had and still have a lot of self-esteem and I was convinced that I could do it. I was convinced that either I could do it or, if some not incredibly tiny number of people could do it, then I could do it and conditional on that being true, it was worth it for me to continue doing it.

Lynette: What is the single most important thing that you do or deliberately started for your productivity, some deliberate choice?

Rohin: Weekly plans. Which is just every Monday morning usually but sometimes afternoon. I like, "Here are all the projects I have. Here are the next steps on them. Here are the ones I intend to do this week. Here's the amount of time I'm budgeting for them. Here's the amount of discretionally or free time that I'm leaving alone." Then the subsequent week, next Monday I will go and check Toggl, tally up all my predictions, see how much time I actually spent on all of them, and aggregate them into a few categories and then look at how much time I'm spending across categories. I think it's just been very useful both on calibrating on how long things take and just giving me information about what I'm spending my time on, which I don't think I could've told you that properly beforehand.

Lynette: Little bit of a different track, I hear the term “good judgment” or sometimes just “thinking well,” tossed around a lot within this community. You were saying some things around just have good models which feels like it's one of the things that people point to when they are gesturing vaguely at good judgment. What other skills or what else goes into this?

Rohin: Cool. I want to say good models feels fairly well defined to me where you can judge it based on your predictions essentially. Are your predictions calibrated, and what's the other word? Giving information. There's a word for it, I can't remember what word it is. Do they have high resolution? I think that's the word. I think that one is reasonably easy to cash out.

Good judgment on the other hand I feel like, really what's needed is for someone to just say, "Here are all the things that could possibly go into good judgment," and then different people will have different subsets of skills that they need. Some things that could plausibly be in it: being able to quickly and efficiently find the crucial points for any given question, being able to start with the right question. These are in some sense the same skill in that really just like if you want to start with one very big question, how do I help the world? Then you want to find all the crucial points for all of those and those become your subquestions. You don't necessarily have to structure your reasoning that way, you can say “I want to improve the world” and then go straight to a question, what would be the best way for me to do that in AI safety which is like several levels down. Those seem important to me.

I think another one would be balancing well between the inside view and the outside view or gearsy models versus established practice. You can make this dichotomy in a bunch of ways. I think inside view versus outside view is a good overarching class. One example that will be familiar to a bunch of people is the unilateralist's curse. It's usually phrased as an outside view thing but I think it's really a combination of the two. The unilateralist’s curse is just, to briefly explain it in case one of the listeners doesn't know what it is: if there are n people, all of whom can take some irreversible action. Let's say, publishing information on how to make a nuke, which is irreversible in the sense that once the information is out there you can't take it back in and hide it away again. You assume that all n of these people are perfectly altruistic, or they are perfectly valued-aligned. They're only going to take the action if they think it's net positive for the world. You could imagine a naive algorithm where each person thinks about their inside view and says if I think this is net positive then I will do it. The problem is even if people are making mistakes at random, even if n-1 of the people think it's net negative and just 1 thinks it's net positive, the action still happens and then can't be reversed by the other n-1 people. Which seems bad if n-1 people think it's bad and 1 think it's good, then probably it's bad and shouldn't be done. That's the unilateralist’s curse.

The way I explained it, it sounds like an outside view argument but in practice when you're actually applying this to make decisions, you also have to decide who the set of n people are. That is a very inside view style judgment. Should I make my research public, my AI research public? If you include everyone in the AI research community and the AI safety community, it's just guarantee going to be: Yes, you should make-- Well, maybe not anymore but two or three years ago that's basically guarantee, you should just make your research public. Maybe we should be excluding the AI research community because they're not that value-aligned, they're not thinking about extinction risks very much. Maybe they're value-aligned but they're not thinking about extinction risk very much, so they're not well placed to evaluate this. We should defer to the people who have actually thought about this a bunch. Maybe that argues more for deferring to MIRI than maybe ML researchers who haven't thought about this as much. Then maybe you, as I do, that MIRI puts too much weight on some considerations and not on others. Then maybe that argues for down weighting MIRI and including other people.

In practice, there's quite a lot of inside view that goes into this and you need to decide at what point you stop applying your inside view and start applying the outside view. I don't know how to do this or better, I don't know principled algorithms of how you can do this and so I can't explain how to do this to anyone but it's something I often find myself doing. I don't know if I'm doing it well, but it's a thing I am doing and I think that it is important.

——————————————————————————————

Enjoying the interview? Subscribe to Lynette’s newsletter to get more posts delivered to you.