NSF Research Experiences for Undergraduates (REU) Site:
Our REU site participants will be engaged in research that spans four themes under the umbrella of Human Communication in a Connected World. These themes are: (1) communication and cybersecurity, (2) scientific experimentation and knowledge capture, and (3) social data science.
These themes are emerging today because of the increased use of the Internet in daily lives. The abundance of applications and the ever-increasing traffic volume cannot be sustained by the current infrastructure. Today’s businesses
and critical infrastructure rely on the Internet more and more; this makes them attractive targets for cyberattacks.
Young people are also increasingly using the Internet for learning and communication, putting themselves at risk of
cyberbullying, cyber-predators, misinformation, etc. Communication and cybersecurity theme explores solutions to
the above problems. Further, much information is automatically collected today by many digital devices, and much
information is published online by various sources. This volume creates challenges for effectively making sense of
information, and detecting important pieces, which can be presented to humans in a meaningful way. The scientific
experimentation and knowledge capture theme explores these issues, and also focuses on broadening participation in
science through education. Finally, increasing use of social networks allows us to study how people connect to each
other, how information propagates in social systems, and how it is acquired and processed by humans. Social data
science theme comprises of such projects.
Below are the projects, which we expect to have during the summer 2021.
Please click on to see expanded descriptions
Communication and cybersecurity
Anycast Visualization (Supervisor: John Heidemann)
USC operates B-Root, one of the 13 DNS servers that operate at the root of the Internet Domain System (above e.g., .com and .edu). B-Root has several sites around the world, and we're getting more. B uses anycast to associate users with a nearby site, and Verfploeter (research done at USC/ISI and U. Twente) to map who goes where. Data from this mapping is currently only rendered as static plots. We would like to put it into a dynamic website, with zooming. Working with the Professor Heidemann and Yuri Pradkin, the student will add anycast data to the current website by writing conversion programs and analyze it to extract information about traffic and client distribution over B-root sites. The visualization we will build off is at our outage site.
Covid-19 Change Visualization (Supervisor: John Heidemann)
We have been visualizing Internet outages at https://outage.ant.isi.edu/ with a world-map. This data reflects events like hurricanes (see https://ant.isi.edu/url/harvey2017 for what happened in Texas in 2017, for example). Recently we have begun looking for Internet changes related to Covid-19 work-from-home. For this project, you will (1) work with a graduate student and staff person to get Covid-19 data into this visualization, (2) work on "drill down" where clicking on a Covid-19 event can exapand to show a report on what network changed, (3) extend drill down to work on "regular" Internet outages.
Understanding DNS Resolvers (Supervisor: Jelena Mirkovic)
Today's DNS resolvers exhibit a wide range of unexpected behaviors, which include malformed queries, too aggressive querying or very sporadic querying. We would like to enumerate these patterns of misbehavior and understand their prevalence and possible causes. Students will work on analyzing DNS traffic and identifying unexpected behaviors using ML, then quantifying the dominant behaviors.
Novel Insights for DDoS Detection (Supervisor: Christophe Hauser)
Distributed) Denial Of Service attacks (DDoS) occur when multiple malicious hosts coordinate to flood the resources of a target system, making it unreachable by legitimate users. A particular subclass of such attacks, qualified as "low-rate", exploits algorithmic weaknesses in the target software. As a result, simple inputs are enough to render a service unavailable. We will focus on the elaboration of runtime mechanisms to identify such attacks at the host level by relying on performance profiling measurements. Such measurements will be leveraged as part of a machine learning model in order to characterize the legitimacy of complex queries.
Stateless Password Manager (Supervisor: Christophe Hauser, Jelena Mirkovic)
This project aims to help students familiarize with notions of applied cryptography through the resolution of a practical problem. Most password managers store an encrypted copy of the users' passwords which is protected by a master key. This requires access to storage, and some degree of trust towards a third party. We propose to study an alternative approach where no storage is required, and where passwords can easily be re-computed on-the-fly by users in a stateless manner. The participant will leverage standard cryptographic schemes in a novel way, by combining secure practices with ease-of-use and study cognitive aspects in order to identify and optimize the trade-offs between security and usability.
Practical Searchable Encryption (Supervisor: Christophe Hauser)
Encryption is one of the cornerstones of secure and private communication. Yet, as of today, most email messages are exchanged in clear-text, at best over encrypted channels (e.g., OpenSSL), in a client-to-server setting. Solutions for end-to-end encryption, such as Gnu Privacy Guard (GPG), allow for private and secure communications, but unfortunately suffer from usability issues which hinder their adoption. One of these issues is the lack of full-text encryption over encrypted emails. The participant will study and implement a full-text-search solution over encrypted emails, based on standard cryptographic schemes, and suitable for both local and remote (i.e., cloud-based) storage, while preserving strong security properties.
Discovering Vulnerabilities in Embedded Code (Supervisor: Christophe Hauser)
With the advent of the Internet of Things (IoT), largely making use of low-power devices, a number of standard abstractions and algorithms are often re-implemented by manufacturers as part of their firmware images. Unfortunately, such implementations often lack best practices and present an increased attack surface due to their ad-hoc designs. In this project, we will focus in particular on the problem of unsafe implementations which may lead to security vulnerabilities, and means to automatically detect those in real-world software. Students will work on review and reproduction of existing vulnerabilities in an isolated, emulated environment and on defining programmatic models to characterize such vulnerabilities.
Software Isolation for Desktop Environments (Supervisor: Christophe Hauser)
Despite the numerous security mechanisms present in modern systems, the security of software in general remains imperfect, leaving the door open to potential attacks. When a piece of software such as a network service is compromised by an attacker, the severity of the resulting attack varies depending on the amount of control that the attacker gains on the overall system. A solution to mitigate successful attacks is software isolation: by isolating software components into "compartments" with limited privilege, it is possible to prevent attackers from accessing sensitive information or taking control over critical parts of a system. In this project, we will focus on applying this paradigm to desktop environments and applications. An example scenario in this context is to limit the attack surface of browser or email attacks in a commodity desktop environment, where all applications traditionally run as the same user. Students willstudy current isolation capabilities as present in Linux, and design and implement a "lightweight" prototype involving low performance overhead over the system.
Cyber-physical Fuzzing Framework (Supervisor: Luis Garcia)
Fuzzing is a standard automated technique for testing software for unwanted or unexpected behavior against different software inputs. When we put software on cyber-physical systems, i.e., systems that interact with the physical world through sensors and actuators, fuzzing the software is not sufficient for triggering unwanted behaviors. The sense-to-actuate pipeline is vulnerable to cyber-physical side-channels, such as sensor spoofing or false data injection. Conversely, we cannot just subject the physical systems to various signal generators as the systems would be subject to damage and could be costly. If we opt for simulation, modeling these behaviors would require capturing the sensor and actuator physical dynamics with high fidelity. The models would also need to understand how the sensor and actuator signals interact with the software. Students will study existing techniques to capture sensor and actuator dynamics in simulation and design and implement a prototype on an existing cyber-physical system testbed that can incorporate these models.
Automating the Generation of Software Knowledge Bases for Reverse Engineering Embedded Systems (Supervisor: Luis Garcia)
In today's world, we increasingly incorporate untrusted, third-party embedded systems into our Internet-of-Things. Reverse engineering is a fundamental skill to assess the security of third-party system software. However, the raw binary files are meaningless without understanding the underlying hardware, such as a microprocessor's architecture and memory layout. Fortunately, manufacturers often use chips from other Original Equipment Manufacturers (OEM's) whose datasheets are publicly available online. However, extracting the information from these datasheets is mostly a manual, tedious process that requires manually entering information into reverse engineering tools. Students will study current techniques for extracting software knowledge bases from datasheets and design and implement a prototype that directly incorporates these knowledge bases into state-of-the-art reverse engineering tools.
Detecting Bias in College Football Recruiting (Supervisor: Jeremy Abramson)
College football recruiting is big business. This project aims to determine if there are biases in who and how college football coaches recruit players. By creating a comprehensive data set of college recruits and integrating relevant data with current socioeconomic markers (i.e. census data) we hope to determine if there are patterns and biases in recruiting, and who (in terms of socio/ethnographic markers), and where (in terms of geography) football coaches recruit their players, regardless of talent.
Textual, Structural and Semantic Analysis of Phishing Datasets (Supervisor: Jeremy Abramson)
Phishing attacks – both specifically and broadly targeted – are an increasingly dangerous vector for malice. Because of the textual and semantic similarities between potentially malicious and benign emails, detection of subtle phishing attacks can be difficult. This project aims to provide a high-level textual and structural analysis of different phishing datasets to determine what features in a conversational chain may be useful in increasing detection of phishing attacks. Students will work on textual extraction of features (intent, sentiment, tone, etc.) and analysis of externally verifiable content (company affiliation, etc.).
Scientific Experimentation and Knowledge Capture
Generation of a Sports-based Introductory Data Science Curriculum to Increase Participation of Underrepresented Groups in STEM (Supervisor: Jeremy Abramson)
As the requirements for success in the workforce become increasingly technical, there is a commeasure need for curricula that can engage and capture the imagination of students, especially those from traditionally underrepresented groups in STEM. One way to reach these groups is via curricula that appeals to contexts in which they’re familiar and engaged, such as sports. To that end, this project will explore the development of a sports-based introductory data science curriculum with the goal of engaging students who might otherwise not be interested in pursuing data science as a career. Students will work on generation of illustrative code examples/problem sets in Python using sports examples.
Disparities in Educational Achievement (Supervisor: Kristina Lerman)
The project will analyze socio-economic data collected from US Census with US colleges and universities to identify correlates of positive educational outcomes. Of specific interest will be assessing how economic inequalities and racial disparities affect educational achievement in different regions of US. The project will use tools developed at ISI (S3D, causal inference, fairness methods) for a deep dive of the underlying relationships in the data.
Integration of Frame Semantics to Cyber Ontologies (Supervisor: Jeremy Abramson)
Cyber ontologies such as STIX and ATT&CK can represent complex relationships between cyber threat actors, attacks and infrastructure. While such representations are conducive to interoperability between systems, they are often unwieldy for human cyber analysts to deal with directly. Conversely, Natural language generation (NLG) frameworks like FrameNet represent language in a structured manner, but frame specifications are often not specific enough for specialized domains (such as cyber security). Leveraging and combining the semantic structure of both forms can create a tool that can translate cyber threat data in standard interoperable formats (such as STIX) to human-readable reports, via existing NLG frameworks. Working on a project such as this provides an opportunity for significant impact, as the fusion of these two structures could greatly increase both the adoption and the utility of cyber threat ontologies.
Long form dialogue bots (Supervisor: Genevieve Bartlett)
Most state-of-the-art dialogue engines are designed to manage chat bots and work over short-form communication where both parties use short phrases or single sentences to continue a back-and-forth conversation. These dialogue engines are not geared towards carrying on longer-form communication such as found in an email thread. This project seeks to piece together existing tools to extend dialogue capabilities to long-form by drawing from AI story telling and existing chat-based bots combined with a rule-based system. Students will research and interact with existing tools, annotate data and participate in creating surveys to evaluate rules and tool combinations.
Building a Simulation-Driven Pedagogic Activity for Cyberinfrastructure Learning (Supervisor: Rafael Ferreira da Silva)
Search Engine for Scientific Software (Supervisor: Daniel Garijo)
An increasing number of publications use and produce scientific software to perform experiments, clean data or visualize results. However, despite recent efforts for standardizing scientific software metadata, finding a software repository that is appropriate for one's needs is still challenging. In this project, the student will develop a semantic search engine that will harvest existing software metadata and will index it in a knowledge graph.
Towards Automated Understanding of Scientific Software (Supervisor: Daniel Garijo)
Scientific software is crucial for reproducibility of scientific results. Nowadays it's common for authors to link code repositories in their scientific publications, but the effort and time needed to understand, set up and reuse those code repositories by others is significant. In this project, the student will explore different approaches to reduce the time needed to understand and set up existing software by using different AI techniques.
What Does Your Program Do? Using Non-Supervised Methods for Scientific Software Classification (Supervisor: Daniel Garijo)
The number of scientific products, including scientific software, has been steadily growing in the last years. This growth makes it difficult for researchers to understand all the latest code and publications available. A great body of research has attempted at classifying similar papers and literature. However, there aren't to date good approaches for finding similar or related code. In this project, the student will analyze different unsupervised methods to find scientific software similarities.
Are STEM Faculty Members Diverse Enough (Supervisor: Mohammad Rostami)
Diversifying STEM faculty members seem to be a major concern of many universities in recent years. It is a wide-spread belief that affirmative action is being used to diversify STEM faculty members, in particular, in terms of gender. However, it is not clear whether this belief reflects the reality. In this project, students will address the following questions: 1) are STEM faculty members diverse? 2) do we observe any diversification trend reflected in the recent hires? 3) what are policies that may help to make more progress towards this goal?
The ideal candidate is a strongly motivated student with a good understanding of data science methods to analyze collected data. Python coding fluency and previous experience with data collection are preferred.
Is SSIM a Fair Image Quality Assessment Metric? (Supervisor: Mohammad Rostami)
Structural similarity (SSIM) was proposed in 2004 as a method to measure image quality and since then, it has become a very popular metric (+28000 citations) to measure image quality instead of more traditional metrics such as PSNR. In this project the goal is to study whether SSIM is fair in measuring image quality? In case of unfair measurements, what can be done to make SSIM a fairer metric.
The ideal candidate is a strongly motivated student with a good understanding of image processing and machine learning. Python and MATLAB coding fluency is strongly preferred.
Social Data Science
Social Network Expansion: Construction of a human-subject spearphishing experiment (Supervisor: Jeremy Abramson)
Social Network Expansion (SNE) aims to explore the relationship between various factors of “cost” in creating social networking personas, and these personas’ efficacy in connecting and interacting with a target populace. A more complete understanding of this relationship between required adversarial complexity/resources and connection/interaction efficacy will enhance our ability to detect and mitigate a number of threats, including (but not limited to) spearphishing, persona hijacking and the spread of fake news.
Social Graph Analysis and Attribution of Software Exploit Contributors Using GitHub (Supervisor: Jeremy Abramson)
Attribution of threat actors is an increasingly important and difficult problem. One potential mitigation is the early detection of potential threat actors via analysis of open-source intelligence (OSINT). This project will analyze the social graph of users who contribute to, follow, star, and otherwise interact with proof-of-concept CVE implementations and other relevant potentially malicious (e.g. software vulnerability) repositories. These social graphs will be analyzed to see if potential “black hat” threat actors have networks that differ from their “white hat” counterparts. If successful, such a project could help speed the discovery of dangerous threat actors, as well as aiding in linking threat actor personas on the internet.
Studying Anti-Science Vulnerabilities (Supervisor: Keith Burghardt and Goran Muric)
Anti-science attitudes are present within a large and recently active minority. From anti-climate to anti-vaccination activists, anti-science poses a significant threat to our nation’s environment and health. The goal of this project is to monitor who is susceptible to anti-science sentiment online and build an anti-science vulnerability score for users in social media. Namely, this score aims to explore heterogenous data and determine users who may develop anti-science sentiment in the future. We will use this model to explore (a) causal models based on propensity score matching, and potential policies to improve pro-science sentiment.
Identifying and Characterizing Online Stereotypes (Supervisor: Fred Morstatter)
Stereotypes are widely held views about a type of person or thing. Can we infer a person's stereotypes from their online behavior? Students will be involved in analyzing public posts in online forums and social media and designing algorithms to infer the poster's stereotypes.
Preventing Gaming of Machine Models (Supervisor: Fred Morstatter)
Some features provide superior predictive performance for machine learning problems. When selecting these features, it is important that we also consider their ability to be "gamed" by unscrupulous people wishing to bypass detection by the model. Can we automatically identify features that are susceptible to gaming?
Characterizing Malicious YouTube Videos (Supervisor: Jelena Mirkovic, Fred Morstatter)
YouTube lacks granular rating for content, which makes it difficult to set meaningful parental controls and automatically categorize these videos. In this work, we aim to build a tool that can generate parental controls from the content of the video, with special care towards videos that contain content inappropriate or disturbing for children.
Bias Awareness in your News Diet (Supervisor: Jelena Mirkovic, Fred Morstatter)
Tackling fake news, misleading news, and misinformation is of utmost importance in today's digital landscape. Unfortunately, research has shown that simply showing content from the opposing viewpoint only hardens one's preconceived views. In this project, students will explore the best ways to inform people of the types and extent of bias present in news articles, and they will build a browser plugin to automatically inform users.
Clickbait Summarizer (Supervisor: Genevieve Bartlett)
Click Bait is "content whose main purpose is to attract attention and encourage visitors to click on a link to a particular web page." Often Click Bait titles are used to get web users to click on articles in search of a summary or answer. Several web plugins offer the ability to identify and block click bait links or modify click bait titles to be less hyperbolic. In this project, we plant to extend this technology with ML and NLP article-summarizing technology to offer cached summaries of click bait articles.
Social Media Analysis (Supervisor: Emilio Ferrara)
Social media have become pervasive tools for planetary-scale communication and now play a central role in our society: from political discussion to social issues, from entertainment to business, such platforms shape the real-time worldwide conversation. But with new technologies, also come abuse: social media have been used for malicious activities including public opinion manipulation, propaganda, and coordination of cyber-attacks. The selected candidate will work on projects related to social media analysis, in particular studying the behaviour of social media users, and the dynamics of use and abuse of social platforms for a variety of purposes including spreading of fake news and social bots.
The ideal candidate is a computationally-minded and strongly motivated student with a clear understanding of social networks, machine learning, and data science methods and applications. Python coding fluency and previous experience with social media mining are preferred.