NYU Center for Data Science, Meet the Researcher: Anton Strezhnev
Meet Anton Strezhnev. Prior to joining CDS as a Moore-Sloan Data Science Fellow, Anton obtained his PhD from the Department of Government at Harvard University in 2018. His current work focuses on applied statistics, causal inference, and global economic governance. We caught up with Anton to discuss his research, what kickstarted his path to data science, and his overall experience at CDS.
This interview has been lightly edited for clarity.
I see you received your PhD in 2018 from the Department of Government at Harvard University. What brought you to CDS?
My background is in political science. Most of my work is at the center of applied statistics and then substantively working on studying legal systems, in particular international law. A lot of my recent work has been on international arbitration and doing empirical work trying to disentangle certain elements of that system. But recently, it’s expanded outwards towards empirical legal studies more broadly. Prior to being at CDS, I was a post-doc at the University of Pennsylvania Law School, where I worked on a number of applied legal projects, quite a few of which are continuing to the present day.
Towards the end of that post-doc, I ended up in touch with Arthur Spirling, who recommended that I apply for the CDS faculty position. He was actually one of my old Professors back at Harvard — I took one of the later graduate method courses with him. Arthur is a wonderful colleague and from what I heard has been really great at promoting the social science elements of CDS and the combination of data science and social sciences. There was sort of this nice confluence of interest on the part of CDS for someone who does empirical legal analysis and empirical legal work, and someone who also does work in the causal inference area. As part of the Moore Sloan Fellowship, I’ve been teaching the undergraduate causal inference course. This semester has not gone as expected with the lockdown, but enrollment is actually at the cap for next semester so I guess people are excited about it.
Lately, a lot of what data science and machine learning is about is solving various predictive problems on high dimensional data. Conversely, a lot of my work tends to be more towards how we identify causal effects from data. In the experimental world, this tends to be easier, though certainly experiments raise a lot of design issues. But what I also really care about is observational designs for causal inference — essentially asking the question of how to say that something is causal in a world where we don’t get to directly randomize the treatment of interest. There, I think there’s growing demand for the kinds of tools being developed in data science.
When did you first become interested in data science?
I think it dates back to when I was an undergraduate at Georgetown University and my first research assistant role. As part of my funding package, I was eligible for work study and I was looking for a position on campus.
One of the faculty members in the Department of Government was looking for someone who could code in Python. And as a kid I had coded in Python as a hobby. I think I still have a couple of old games that I wrote back in the early 2000’s. I probably have a clone of Tetris that I wrote lying around somewhere on an old CD. So, I thought that I might as well put some of this to use.
In coding, I got tasked with this project porting over and modifying some code which was doing a topic/text analysis of the Enron email corpus. This project in particular was interested in quantifying and measuring the extent to which business executives within Enron discussed politics and in what ways. This Professor, Dan Hopkins — who is now at the University of Pennsylvania — and I formed this very great collaborative relationship and worked on a whole bunch of projects. By working on these projects I eventually started to pick up — my exposure to statistics at the time was pretty minimal — the basics of probability, estimation and statistical inference along with coding in statistical computing languages, especially R. And that led to another collaboration with a faculty member, Erik Voeten, who was working on projects that were more in my area on international courts and international organizations. We eventually wound up working together on this really big project building a model to estimate state preferences on international issues over time using voting in the UN General Assembly. A lot of existing empirical work used voting agreement as a measure of how aligned the preferences of two states are. This project was trying to improve the quality of this measure by accounting for the fact that the issues being voted on change every single year. So it was essentially trying to build a model of preference change and agenda change simultaneously, leveraging the fact that some resolutions get voted on repeatedly year-after-year. It was a really cool project and is still probably one of my best cited papers. I found that I really enjoyed doing social science research and so grad school was something that I was interested in doing. I found in grad school that a lot of my interests were being drawn towards data science, applied statistics and developing new tools.
So tell me a little bit about what you’ve been working on at CDS and describe your overall experience here so far.
CDS has been wonderful, both the community of folks in the department and the diversity of projects that everyone is working on. Like I know very, very little about computer vision, but there are a lot of people who I would (at least before the quarantine) sit next to who were working on projects dealing with interesting dimensionality reduction problems. As it turns out, we encounter similar challenges in causal inference when trying to adjust for a large number of covariates. I think the Faculty Fellows are a really great diverse bunch; we’re working on a lot of very, very different projects and I think CDS encourages us to talk about the common, similar, methodological problems that we encounter.
A lot of what I’ve been trying to do at CDS is to further develop some of my projects on methodology for causal inference. Some of my dissertation work was looking at a problem that arises in designs called differences-in-differences designs. A lot of these studies are looking at, for example, states or cities implementing policies at different times. To give a recent example: state adoption of stay-at-home orders. I would expect that there are certainly plenty of papers coming in the near future using this sort of design. Some states implement a policy earlier than others and some don’t implement any policy at all. We know that these groups probably differ in other ways, so looking at the difference in whatever the outcome of interest is between them isn’t going to reflect the causal effect of the policy we care about. The difference-in-differences strategy tries to adjust for the bias by starting with that naive difference between states that implemented the policy and those that didn’t and then subtracting off the difference in the observed outcome between those two groups from past periods when neither had adopted the policy yet. The problem that a lot of researchers are dealing with is that this strategy can be tough to generalize to the “staggered adoption” setting, where instead of having one group of units that all adopt a policy at the same time and another control group that never does, most units adopt the policy eventually but at different times. Some of what I’ve been working on here is figuring out how to do covariate adjustment properly, especially for variables that are changing over time, without introducing additional biases.
One thing we want to avoid in causal inference is adjusting for variables that are affected by the treatment we care about, otherwise that risks creating bias. It can be tough to do this in settings where you have time series data and multiple treatments assigned in a sequence over time — you can get feedback between the treatment and the confounders. A challenge in this area is figuring out how to do covariate adjustment without introducing too many additional modeling assumptions. This is a common theme in a lot of causal inference research and is one of the motivations for another interesting paper that I’ve been working on with a colleague at Harvard — Matt Blackwell. We’ve been working on extending “matching” methods, which can help address problems of model dependence, to observational designs with multi-shot treatment regimes. These settings are particularly difficult since we can have some variables that are a consequence of the first treatment affect the second treatment and the outcome — so both post-treatment *and* a confounder.
Another very fun project that began when I was a post-doc at the University of Pennsylvania is a study of evictions and leases in Philadelphia with my colleague David Hoffman at the law school. For about a year, we’ve been analyzing the dataset of about a couple hundred thousand eviction cases from the Philadelphia Landlord-Tenant court. And we’ve been trying to look at whether there is variation in how successful tenants are at defending against eviction that can be explained by various features of the leases. So we have, in addition to the court records, many leases that are attached as exhibits.
A large portion of this research has actually been data cleaning — these leases are often low-quality scans with a lot of handwritten components. We’ve worked on developing some search algorithms for the digitized text to determine whether a lease has certain types of provisions. One provision that we’re particularly interested in is a waiver of notice — basically waiving a tenant’s right to receive a written notice from the landlord in advance of starting court proceedings (typically 10–15 days in advance). We’ve found that one of the most common outcomes in landlord-tenant court is a default judgment in favor of the landlord — essentially when the tenant fails to appear in court and the landlord wins simply by showing up. We think one way leases might matter is by affecting how capable tenants are in defending themselves; by not receiving advance notice tenants may be less prepared to contest the eviction in court. Essentially the judgment goes to the person who showed up in court. About one third of the battle in Philadelphia is to appear so we’re looking at what predicts whether or not people are able to show up. One of our other recent discoveries was using Google Maps data. Even after controlling for demographic characteristics at the census tract level, the longer it takes to commute to the Philadelphia Municipal Court from the address of the rental property, the more likely it is the tenant will lose by default.
This relationship was especially strong among households in public housing or with subsidized leases. So just being more able to get to the courthouse on time makes it more likely that a tenant won’t lose automatically. I think Philadelphia is currently in a position to revise and rethink a lot of its landlord/tenant laws. There have been activists looking into things such as expanding access to representation for tenants and we hope our research can help inform some of those decisions.
What role do you see your work playing in the future of data science?
I would hope my research contributes to making it easier for applied researchers to do causal inference in observational settings where you don’t necessarily have the advantage of being able to directly control the assignment of treatment and do so randomly. Sometimes you’re stuck in this world where you have a lot of data that’s very easy to collect, but it is still difficult or impossible to actually run an experiment. But you still want to say something causal. The question of how to do adjustment for potential confounders correctly and in an efficient manner is a subject of research that I’m really interested in.
I’ll actually be finishing out my Moore-Sloan Faculty Fellowship next year. After that, in Fall of 2021, I will be starting an assistant professor position at the University of Chicago in the political science department as a political methodologist. I think nowadays applied statistics researchers in political science overlap a lot with folks in economics, sociology and other social sciences disciplines in terms of basic agreement on how to think about causation and make causal claims. These disciplines are increasingly all on the same page and can talk to each other relatively well. My substantive interests are in political science but hopefully, a lot of what we’re developing has broader applications to the social sciences more broadly. I think one of the things I hope to push the frontier forward on is the question of how do we do causal inference better when we’re stuck with observational designs.
Congrats on the role at University of Chicago. That’s definitely exciting… Do you have any final thoughts or comments you’d like to share about yourself, your work, CDS or just data science in general?
Data science is kind of in this interesting place where there is a lot of demand for it but no one really knows what it means. Does it have its own unique identity that’s different from other disciplines? What I like about CDS is that it really takes interdisciplinarity to heart, and it’s sort of willing and able to pull together folks from a wide array of backgrounds. Where I think that really helps is in connecting students to jobs. The role of a “data scientist” will vary wildly from company to company and so having a well-rounded understanding of statistical modeling from a variety of perspectives is really helpful to our graduates.
I think CDS benefits from its industry connections and contacts and that experience helps inform and improve the curriculum. I also appreciate that CDS has embraced a broad view of data science as a discipline. Often, the topics that get the most attention in data science are the big, high-dimensional prediction problems and neural networks, but I’m glad that causal inference is also seen as important and is a part of the core requirements for the recently-created data science major for undergraduates. Data scientists, especially at companies that can’t afford to develop research teams focused on very narrow topics, are going to be doing a lot of different things on any given day. And I appreciate that CDS has structured its teaching with the aim of training very well-rounded data scientists.
By Ashley C. McDonald