Our REU site participants will be engaged in research that spans three themes under the umbrella of Safe, Usable, Reliable and Fair Internet. These themes are: (1) communication and cybersecurity, (2) scientific experimentation and knowledge capture, and (3) social data science.
These themes are emerging today because of the increased use of the Internet in daily lives. The abundance of applications and the ever-increasing traffic volume cannot be sustained by the current infrastructure. Today’s businesses
and critical infrastructure rely on the Internet more and more; this makes them attractive targets for cyberattacks.
Young people are also increasingly using the Internet for learning and communication, putting themselves at risk of
cyberbullying, cyber-predators, misinformation, etc. Communication and cybersecurity theme explores solutions to
the above problems. Further, much information is automatically collected today by many digital devices, and much
information is published online by various sources. This volume creates challenges for effectively making sense of
information, and detecting important pieces, which can be presented to humans in a meaningful way. The scientific
experimentation and knowledge capture theme explores these issues, and also focuses on broadening participation in
science through education. Finally, increasing use of social networks allows us to study how people connect to each
other, how information propagates in social systems, and how it is acquired and processed by humans. Social data
science theme comprises of such projects.
Below are some sample projects. Please check back in January 2025 for an updated list.
Please click on to see expanded descriptions
Anycast Visualization (Supervisor: John Heidemann)
USC operates B-Root, one of the 13 DNS servers that operate at the
root of the Internet Domain System (above e.g., .com and .edu). B-Root
has several sites around the world, and we're getting more. B uses
anycast to associate users with a nearby site, and Verfploeter
(research done at USC/ISI and U. Twente) to map who goes where. Data
from this mapping is currently only rendered as static plots. We would
like to put it into a dynamic website, with zooming and
interaction. Working with the Professor Heidemann and Yuri Pradkin,
the student will add anycast data to the current website by writing
conversion programs and analyze it to extract information about
traffic and client distribution over B-root sites.
As an optional second phase, we will work with B-Root operators to
investigate anycast latency problems and potentially identify fixes.
We expect to build off our current website visualizing Internet
outages at https://outage.ant.isi.edu/.
Identifying Regular Traffic in DNS services (Supervisor: John Heidemann)
USC operates B-Root, one of the 13 DNS servers that operate at the root
of the Internet Domain System (above e.g., .com and .edu). B-Root
receives a lot of questionable traffic, including people monitoring it,
and private DNS names that leak into the root (such as from cloud
services that use cloud-specific internal names). In this project, the
student will study a sample of B-Root traffic to identify different
types of questionable traffic, including traffic we see leaking from
AWS. After removing this traffic, the student will examine the
remaining traffic to look for periodic traffic suggesting computers that
monitor B-Root. The outcome will help address bugs in AWS services, and
identify who is watching our DNS service, as one example of identifying
questionable network traffic.
Distributed Knowledge Capture and Transfer (Supervisor: Jeremy Abramson)
Modern machine learning techniques like neural nets have proven adept and processing large amounts of data and detecting patterns that enable these models to make good predictions. However, it is often difficult for human operators to both reason about the outputs of neural nets as well as inject their own expertise into the system. This project seeks to advance the state of the art by building distributed “learn by doing” systems that aggregate shared knowledge and allow human learning expertise “feedback loops”. We’ll test this system in an applicable simulated environment.
An ideal candidate for this project has a strong Python programming background, is enthusiastic about research, and has experience (or interest!) with some of the following: causal learning, machine learning, distributed systems, agent-based simulations, social networks.
Self-learning Computers (Supervisor: Jeremy Abramson)
This project will attempt to go beyond traditional approaches like reinforcement learning in order to build an automated system that can think “out of the box”. How do systems look beyond their own data? How do we measure how “surprising” a strategy is? We’ll explore different methods (knowledge graphs, game theory, LLNs, common sense reasoning, etc.) and contexts in simulation.
An ideal candidate for this project has a strong Python programming background, is enthusiastic about research, and has experience (or interest!) with some of the following: machine learning, knowledge graphs, reinforcement learning, game theory, agent-based simulation
Privacy-preserving Record Linkage (Supervisor: Srivatsan Ravi)
Performing Record Linkage (RL) over data from varied and distributed data sources is challenging. For example, consider two different patient databases belonging to different hospitals, the first one containing records for Peter and Bruce, and the second for Tony, Pet, and Brvce (Pet is an abbreviation for Peter \& Brvce is a misspelling of Bruce).
The RL task here is to identify records from the two databases that belong to the same patients. A RL algorithm will compare the information present in the records (e.g. names, address, phone numbers), and determine if the records refer to the same person. The principal challenge with this is that the two hospitals may record patient information in different formats (i.e. the first may use separate fields for first and last name, whereas the second may store the full name as one field). The privacy-preserving record linkage (PPRL) task is an extension of the RL task, with the additional requirement to not reveal sensitive information to any party present in the computation or to an adversary. Specifically, each party learns which of its records are present on other parties, but nothing about any other records; an adversary should learn nothing about any records of any party.
In this project, we will seek to leverage multi-party computation cryptographic tools for solving the PPRL task efficiently over large data sets and benchmarking the costs of enforcing strong privacy.~
Mixed-methods Approaches to Evaluating Large Language Models (Supervisor: Mayank Kejriwal)
With the advent of generative AI (GenAI) systems and large language models like BARD and ChatGPT, there is increased interest in evaluating high-level properties of these models that resemble human cognition and general intelligence on the surface. Much research has already been done on testing these models' ability to make rational decisions, solve commonsense problems, plan probabilistically, do social reasoning, and so on. However, numbers alone cannot tell a complete story when it comes to these complex subjects. Instead, we need approaches that are inspired both by the quantitative sciences, and by fields like sociology and humanities, that have long considered qualitative approaches to be an important tool in their arsenal. Mixed-methods approaches aim to combine the benefits of both, but there is little work yet on using such approaches to evaluate large language models. This project aims to adapt, and devise, appropriate mixed-methods approaches to evaluate these models. While some experiments will be quantitative and meant to scale, others will involve dialoguing with models like ChatGPT directly, and analyzing transcripts post-hoc using qualitative rubrics.
Polarization has grown dramatically in recent years, with surveys
showing that liberals and conservatives not only disagree
on policy issues but also dislike and distrust each other.
While studies have implicated social media in amplifying polarization, there
is no consistent framework to measure it. We will investigate using
language models to measure polarization on social media, especially
where it grows to be emotionally divisive.
PauseNow
(Supervisor: Yixue Zhao)
In the digital era, the prevalence of social media usage, particularly among youth, has raised critical concerns about mental health and digital well-being. As we navigate through an increasingly connected world, the impact of digital usage on our psychological and emotional states becomes a crucial area of research. The REU project PauseNow aims to build a user-centric digital well-being tool designed to empower intentional social media engagement. Recognizing the paradoxical nature of social media, which can be both beneficial and detrimental, PauseNow shifts the focus from reducing usage to aligning digital behaviors with user intentions. It uses a personalized, adaptive approach, leveraging nudge theory and reinforcement learning to remind users of their original intentions and provide agency in managing digital interactions. The REU student is expected to have experience with Android app development.
The Sound and the Fury: Understanding the Link Between Social Media and Offline Violence (Supervisor: Keith Burghardt)
Fringe and potentially harmful narratives pertaining to racism, conspiracy theories, or anti-government sentiment are present within a large and recently active minority of users on the web. In many cases, however, these narratives incite violence, therefore capturing these narratives and their offline effect is critical to understand the thought process of extremist users and to develop techniques to predict and potentially stop violence from occurring. Our project will analyze social media and violent event databases via natural language processing and causal analysis to understand what makes extremist content spread across multiple platforms as well as its offline impact. This work expands on previous research from a 2022 and a 2023 REU internship (https://www.kburg.co/understanding-hate.html), which has received substantial attention in the media (e.g., https://www.latimes.com/business/technology/story/2023-04-27/hate-speech-twitter-surged-since-elon-musk-takeover).
AI-based Homeless Detection: Improving Lives at Scale (Supervisor: Keith Burghardt and Seon Ho Kim)
In the past decade, homelessness has remained stubbornly high in the U.S. with 580,000 individuals without a home as of 2023 (over 10% of whom are in Los Angeles County alone). These numbers, however, belie issues with traditional on-the-ground methods to track unsheltered populations, which are currently inexact, costly, time-consuming, lack standardization across cities, and have low spatio-temporal resolution. Better methods for mapping where unhoused people live over time are crucial to track the impact of temporary policies, weather, or natural disasters, as well as long-term changes in homelessness. Moreover, accurately locating homeless populations allow government workers to more efficiently provide medical interventions, harm reduction, and other services. This project will utilize models to detect homeless encampments and other makeshift shelters from across the US using satellite and Google Street View images and then analyze the reasons for the spatial distribution and changes in this distribution. These data will be a step towards substantially improving the lives of an often-forgotten demographic.
Decipherment of Diaspora Languages (Supervisor: Jonathan May)
The displacement and shattering of communities by force, famine, or natural migration has led to the creation of disapora languages, such as Ladino, Sarnami, and Istro-Romanian. Both top-down forces, such as government language policy, and bottom-up motivations, such as a desire to sever ties to a traumatic past, have led to the near-extinction of these languages. As these languages lose their speakers and their texts decompose, knowledge about the cultures is also in jeopardy. For many languages, only untranslated and largely unknown fragments exist, however they are closely related to co-territorial or "parent" languages. We believe we can treat diaspora language fragments as a character-based decipherment problem. We will work with researchers from Hebrew Union College, who will supply transcribed fragments of Jewish diaspora language fragments, and we will apply our frequency-based deep learning approach that was used to decode hidden messages from the Vatican Secret Library.
Social Media Analysis (Supervisor: Emilio Ferrara)
Social media have become pervasive tools for planetary-scale communication and now play a central role in our society: from political discussion to social issues, from entertainment to business, such platforms shape the real-time worldwide conversation. But with new technologies, also come abuse: social media have been used for malicious activities including public opinion manipulation, propaganda, and coordination of cyber-attacks. The selected candidate will work on projects related to social media analysis, in particular studying the behaviour of social media users, and the dynamics of use and abuse of social platforms for a variety of purposes including spreading of fake news and social bots.
The ideal candidate is a computationally-minded and strongly motivated student with a clear understanding of social networks, machine learning, and data science methods and applications. Python coding fluency and previous experience with social media mining are preferred.