NSF Research Experiences for Undergraduates Site:
Our REU site participants will be engaged in research that spans four themes under the umbrella of Safe, Usable, Reliable and Fair Interenet. These themes are: (1) communication and cybersecurity, (2) scientific experimentation and knowledge capture, and (3) social data science.
These themes are emerging today because of the increased use of the Internet in daily lives. The abundance of applications and the ever-increasing traffic volume cannot be sustained by the current infrastructure. Today’s businesses
and critical infrastructure rely on the Internet more and more; this makes them attractive targets for cyberattacks.
Young people are also increasingly using the Internet for learning and communication, putting themselves at risk of
cyberbullying, cyber-predators, misinformation, etc. Communication and cybersecurity theme explores solutions to
the above problems. Further, much information is automatically collected today by many digital devices, and much
information is published online by various sources. This volume creates challenges for effectively making sense of
information, and detecting important pieces, which can be presented to humans in a meaningful way. The scientific
experimentation and knowledge capture theme explores these issues, and also focuses on broadening participation in
science through education. Finally, increasing use of social networks allows us to study how people connect to each
other, how information propagates in social systems, and how it is acquired and processed by humans. Social data
science theme comprises of such projects.
Below are the projects, which we expect to have during the summer 2022.
Please click on to see expanded descriptions
Communication and cybersecurity
Anycast Visualization (Supervisor: John Heidemann)
USC operates B-Root, one of the 13 DNS servers that operate at the root of the Internet Domain System (above e.g., .com and .edu). B-Root has several sites around the world, and we're getting more. B uses anycast to associate users with a nearby site, and Verfploeter (research done at USC/ISI and U. Twente) to map who goes where. Data from this mapping is currently only rendered as static plots. We would like to put it into a dynamic website, with zooming. Working with the Professor Heidemann and Yuri Pradkin, the student will add anycast data to the current website by writing conversion programs and analyze it to extract information about traffic and client distribution over B-root sites. The visualization we will build off is at our outage site.
Finding and Fixing Peristent IPv6 Problems (Supervisor: John Heidemann)
IPv6 is the next big thing, and it's now a third of Google's traffic, but it still has some rough edges---network data shows loss rates are about 2x higher for IPv6 for some cases, perhaps due to persistent IPv6 routing problems. In this project the student will analyze existing IPv4 and IPv6 data taken from thousands of sites to see if IPv4 and IPv6 loss rates are random or persistent at specific sites, and to diagnose the problems (so they can be fixed) if they are persistent.
Understanding DNS Resolvers (Supervisor: Jelena Mirkovic)
Today's DNS resolvers exhibit a wide range of unexpected behaviors, which include malformed queries, too aggressive querying or very sporadic querying. We would like to enumerate these patterns of misbehavior and understand their prevalence and possible causes. Students will work on analyzing DNS traffic and identifying unexpected behaviors using ML, then quantifying the dominant behaviors.
Discovering Vulnerabilities in Embedded Code (Supervisor: Christophe Hauser)
With the advent of the Internet of Things (IoT), largely making use of low-power devices, a number of standard abstractions and algorithms are often re-implemented by manufacturers as part of their firmware images. Unfortunately, such implementations often lack best practices and present an increased attack surface due to their ad-hoc designs. In this project, we will focus in particular on the problem of unsafe implementations which may lead to security vulnerabilities, and means to automatically detect those in real-world software. Students will work on review and reproduction of existing vulnerabilities in an isolated, emulated environment and on defining programmatic models to characterize such vulnerabilities.
Cyber-physical Fuzzing Framework (Supervisor: Luis Garcia)
Fuzzing is a standard automated technique for testing software for unwanted or unexpected behavior against different software inputs. When we put software on cyber-physical systems, i.e., systems that interact with the physical world through sensors and actuators, fuzzing the software is not sufficient for triggering unwanted behaviors. The sense-to-actuate pipeline is vulnerable to cyber-physical side-channels, such as sensor spoofing or false data injection. Conversely, we cannot just subject the physical systems to various signal generators as the systems would be subject to damage and could be costly. If we opt for simulation, modeling these behaviors would require capturing the sensor and actuator physical dynamics with high fidelity. The models would also need to understand how the sensor and actuator signals interact with the software. Students will study existing techniques to capture sensor and actuator dynamics in simulation and design and implement a prototype on an existing cyber-physical system testbed that can incorporate these models.
Automating the Generation of Software Knowledge Bases for Reverse Engineering Embedded Systems (Supervisor: Luis Garcia)
In today's world, we increasingly incorporate untrusted, third-party embedded systems into our Internet-of-Things. Reverse engineering is a fundamental skill to assess the security of third-party system software. However, the raw binary files are meaningless without understanding the underlying hardware, such as a microprocessor's architecture and memory layout. Fortunately, manufacturers often use chips from other Original Equipment Manufacturers (OEM's) whose datasheets are publicly available online. However, extracting the information from these datasheets is mostly a manual, tedious process that requires manually entering information into reverse engineering tools. Students will study current techniques for extracting software knowledge bases from datasheets and design and implement a prototype that directly incorporates these knowledge bases into state-of-the-art reverse engineering tools.
Modeling Human Operators in Safety-critical Industrial Control Systems (Supervisor: Luis Garcia and Dave DeAngelis)
The industrial control systems (ICS) that our society relies on are built on a complex, interconnected arrangement of hardware, software, networks, people, and data, all operating in the physical world. Existing cyber-defense measures to protect ICSes have regularly proven inadequate, as threats from nation-state actors and other advanced adversaries rise, crippling safety-critical applications, endangering lives, and causing tens of billions of dollars worth of damage. The capabilities of these systems increases as they become more and more interconnected, autonomous, and remotely administered. With these new capabilities, the complexity of ICSes is growing rapidly. Industries need to rely on more sophisticated automated surveillance and remediation of these complex, autonomous industrial control systems. We propose to develop a new method for generating and monitoring *cyber-physical invariants* in industrial control systems and automatically taking remedial actions in order to advance safety and security.
In this project, the students will focus on subset of this problem: how can we model human interaction in complex industrial control systems? Recent history has shown that human operators can both act as a last line of defense against intrusions or they cause the damage themselves via human error or an insider threat. Thus, the students will build on existing research that models human behavior relative to security protocols and cyber-physical safety models.
Detecting Bias in College Football Recruiting (Supervisor: Jeremy Abramson)
College football recruiting is big business. This project aims to determine if there are biases in who and how college football coaches recruit players. By creating a comprehensive data set of college recruits and integrating relevant data with current socioeconomic markers (i.e. census data) we hope to determine if there are patterns and biases in recruiting, and who (in terms of socio/ethnographic markers), and where (in terms of geography) football coaches recruit their players, regardless of talent.
Scientific Experimentation and Knowledge Capture
Open-world Game Theory (Supervisor: Mayank Kejriwal)
Open-world learning has taken on new importance in recent years as AI systems continue to be applied and transitioned to real-world settings where unexpected events (‘novelties’) can, and do, occur. The situation becomes even more complex when multiple agents are involved, each trying to selfishly maximize their gains (such as winning a game). Game theory is one way to model such agent interactions, including competition and cooperation. It can also be used to rigorously model the informational complexity of certain environments, including agent payoffs, decision matrices and incomplete information. Using game theory to model the open world is an interesting and open challenge with potential for broad impacts. The student will work with Prof. Kejriwal and his graduate students to develop both theoretical and empirical methodologies in support of open-world game theory.
Knowledge Acquisition with Indirect Supervision (Supervisor: Muhao Chen)
Knowledge acquisition (e.g., relation extraction, entity and event typing) faces challenges including extreme label spaces, few-shot/zero-shot predictions and out-of-domain prediction. To this end, we study methods for leveraging indirect supervision signals from auxiliary tasks (e.g., natural language inference, text summarization, etx.) to foster robust and generalizable inference for knowledge acquisition. In the same context, we study methods for generating semantically rich label representations based on either gloss knowledge or structural knowledge from a well-populated lexical knowledge base, in order to better support learning with limited labels.
Event-Centric Natural Language Processing (Supervisor: Muhao Chen)
Human languages evolve to communicate about events happening in the real world. Therefore, understanding events plays a critical role in natural language understanding (NLU). A key challenge to this mission lies in the fact that events are not just simple, standalone predicates. Rather, they are often described at different granularities, temporally form event processes, and are directed by specific central goals in a context. Our research in this line helps the machine understand events described in natural language. This includes the understanding of how events are connected, form processes or structure complices, and the recognition of typical properties of events (e.g., space, time, salience, essentiality, implicitness, memberships, etc.).
Robust Information Extraction from Human Language Text (Supervisor: Muhao Chen)
Knowledge graphs (KGs) provide both open-world and domain-specific knowledge representations that are integral to many AI systems. However, constructing KGs is usually very costly and requires extensive effort. A widely attempted solution is to learn knowledge acquisition models that automatically induce structured knowledge from unstructured text. However, such models developed through data-driven machine learning are usually fragile to noise in learning resources, and may fall short of providing reliable inference on large, heterogeneous real-world data. We are developping a general meta-learning framework that seeks to systematically improve the robustness of learning and inference for data-driven knowledge acquisition models. We seek to solve several key problems to accomplish the goal: (i) How to identify incorrect training labels and prevents overfitting on noisy labels; (ii) how do detect invalid input instances in inference (e.g., out-of-distribution ones) and provide abstention-awareness; (iii) automated constraint learning that strengthens model inference with global consistency; (iv) how to automatically augment training signals of the knowledge acquistion model or the backbone language model.
Machine Commonsense Reasoning with Minimal Supervision (Supervisor: Muhao Chen)
Various types of commonsense inference tasks are challenging the SOTA language models. Such tasks may include inferring preconditions of facts, typical properties of entities and events (e.g. time, scales and numerical properties), and typical relations (e.g. ordering and membership of events, topological relations of entities). While annotating data for those aspects of commonsense inference can be costly, we seek to minimally leverage any expensive annotations, but instead develop linguistic pattern mining techniques to find vast cheap (though allowably noisy) supervision data from the Web, and lead that towards a scalable and generalizable solution to improve commonsense inference based on distant supervision.
Generation of a Sports-based Introductory Data Science Curriculum to Increase Participation of Underrepresented Groups in STEM (Supervisor: Jeremy Abramson)
As the requirements for success in the workforce become increasingly technical, there is a commeasure need for curricula that can engage and capture the imagination of students, especially those from traditionally underrepresented groups in STEM. One way to reach these groups is via curricula that appeals to contexts in which they’re familiar and engaged, such as sports. To that end, this project will explore the development of a sports-based introductory data science curriculum with the goal of engaging students who might otherwise not be interested in pursuing data science as a career. Students will work on generation of illustrative code examples/problem sets in Python using sports examples.
Long form dialogue bots (Supervisor: Genevieve Bartlett)
Most state-of-the-art dialogue engines are designed to manage chat bots and work over short-form communication where both parties use short phrases or single sentences to continue a back-and-forth conversation. These dialogue engines are not geared towards carrying on longer-form communication such as found in an email thread. This project seeks to piece together existing tools to extend dialogue capabilities to long-form by drawing from AI story telling and existing chat-based bots combined with a rule-based system. Students will research and interact with existing tools, annotate data and participate in creating surveys to evaluate rules and tool combinations.
Are STEM Faculty Members Diverse Enough (Supervisor: Mohammad Rostami)
Diversifying STEM faculty members seem to be a major concern of many universities in recent years. It is a wide-spread belief that affirmative action is being used to diversify STEM faculty members, in particular, in terms of gender. However, it is not clear whether this belief reflects the reality. In this project, students will address the following questions: 1) are STEM faculty members diverse? 2) do we observe any diversification trend reflected in the recent hires? 3) what are policies that may help to make more progress towards this goal?
The ideal candidate is a strongly motivated student with a good understanding of data science methods to analyze collected data. Python coding fluency and previous experience with data collection are preferred.
Is SSIM a Fair Image Quality Assessment Metric? (Supervisor: Mohammad Rostami)
Structural similarity (SSIM) was proposed in 2004 as a method to measure image quality and since then, it has become a very popular metric (+28000 citations) to measure image quality instead of more traditional metrics such as PSNR. In this project the goal is to study whether SSIM is fair in measuring image quality? In case of unfair measurements, what can be done to make SSIM a fairer metric.
The ideal candidate is a strongly motivated student with a good understanding of image processing and machine learning. Python and MATLAB coding fluency is strongly preferred.
Record Breaking Audio Reconstruction (Supervisor: Keith Burghardt)
Significant efforts have been made to digitize historic audio records, but their fidelity is greatly limited by the quality of wax or vinyl as well as their inevitable degradation. These important recordings are therefore exceedingly difficult to appreciate or even understand. Despite the motivation to reconstruct audio, the field is under-explored compared to image or video reconstruction, used in everything from colorization to movie remastering. In this project, we will improve upon the state-of-the-art with transformer-based audio reconstruction. We will then create easy-to-use general-purpose tools for historians and even laypeople to listen to otherwise heavily degraded historic recordings. This could also be useful, for example, for automating speech-to-text for noisy signals and other broader AI tools.
Social Data Science
Social Graph Analysis and Attribution of Software Exploit Contributors Using GitHub (Supervisor: Jeremy Abramson)
Attribution of threat actors is an increasingly important and difficult problem. One potential mitigation is the early detection of potential threat actors via analysis of open-source intelligence (OSINT). This project will analyze the social graph of users who contribute to, follow, star, and otherwise interact with proof-of-concept CVE implementations and other relevant potentially malicious (e.g. software vulnerability) repositories. These social graphs will be analyzed to see if potential “black hat” threat actors have networks that differ from their “white hat” counterparts. If successful, such a project could help speed the discovery of dangerous threat actors, as well as aiding in linking threat actor personas on the internet.
Understanding Echo Chambers and Extremism in Social Media (Supervisor: Keith Burghardt and Goran Muric)
Fringe and potentially harmful narratives pertaining to racism, conspiracy theories, or anti-science sentiment are present within a large and recently active minority of users on the web. Many theories exist to explain the rise in extremism, but past research has not fully explored the impact online echo chambers may have on this rise. In this project, we will explore the impact of users joining extremist groups on social media with Reddit as a real-world testbed. Understanding this effect in realistic settings, however, can be a challenge because those who join extremist groups are unlike the population at large. We will therefore apply methods from the field of causality to mitigate this difficulty and study the impact on users joining a variety of fringe groups, from hate speech to conspiracies to anti-science and anti-vaccine groups.
Bias Awareness in your News Diet (Supervisor: Jelena Mirkovic, Fred Morstatter)
Tackling fake news, misleading news, and misinformation is of utmost importance in today's digital landscape. Unfortunately, research has shown that simply showing content from the opposing viewpoint only hardens one's preconceived views. In this project, students will explore the best ways to inform people of the types and extent of bias present in news articles, and they will build a browser plugin to automatically inform users.
Social Media Analysis (Supervisor: Emilio Ferrara)
Social media have become pervasive tools for planetary-scale communication and now play a central role in our society: from political discussion to social issues, from entertainment to business, such platforms shape the real-time worldwide conversation. But with new technologies, also come abuse: social media have been used for malicious activities including public opinion manipulation, propaganda, and coordination of cyber-attacks. The selected candidate will work on projects related to social media analysis, in particular studying the behaviour of social media users, and the dynamics of use and abuse of social platforms for a variety of purposes including spreading of fake news and social bots.
The ideal candidate is a computationally-minded and strongly motivated student with a clear understanding of social networks, machine learning, and data science methods and applications. Python coding fluency and previous experience with social media mining are preferred.