Deploying nEmesis: Preventing Foodborne Illness by Data Mining Social Media

How to use Twitter to spot bad restaurants spreading food diseases


Foodborne illness afflicts 48 million people annually in the U.S. alone. Over 128,000 are hospitalized and 3,000 die from the infection. While preventable with proper food safety practices, the traditional restaurant inspection process has limited impact given the predictability and low frequency of inspections, and the dynamic nature of the kitchen environment.

Adam Sadilek∗1 , Henry Kautz1 , Lauren DiPrete2 , Brian Labus2 , Eric Portman1 , Jack Teitel1 , and Vincent Silenzio3
1- Department of Computer Science, University of Rochester, Rochester, NY
2- Southern Nevada Health District, Las Vegas, NV
3- School of Medicine & Dentistry, University of Rochester, Rochester, NY

Despite this reality, the inspection process has remained largely unchanged for decades. We apply machine learning to Twitter data and develop a system that automatically detects venues likely to pose a public health hazard. Health professionals subsequently inspect individual flagged venues in a double blind experiment spanning the entire Las Vegas metropolitan area over three months.

By contrast, previous research in this domain has been limited to indirect correlative validation using only aggregate statistics. We show that adaptive inspection process is 64% more effective at identifying problematic venues than the current state of the art. The live deployment shows that if every inspection in Las Vegas became adaptive, we can prevent over 9,000 cases of foodborne illness and 557 hospitalizations annually. Additionally, adaptive inspections result in unexpected benefits, including the identification of venues lacking permits, contagious kitchen staff, and fewer customer complaints filed with the Las Vegas health department.

The fight against foodborne illness is complicated by the fact that many cases are not diagnosed or traced back to specific sources of contaminated food. In a typical U.S. city, if a food establishment passes their routine inspection, they may not see the health department again for up to a year. Food establishments can roughly predict the timing of their next inspection and prepare for it. Furthermore, the kitchen environment is dynamic, and ordinary inspections merely provide a snapshot view. For example, the day after an inspection, a contagious cook or server could come to work or a refrigerator could break, either of which can lead to a food poisoning. Unless the outbreak is massive, the illness is unlikely to be traced back to the venue.

We present a novel method for detecting problematic venues quickly—before many people fall ill. We use the phrase adaptive inspections for prioritizing venues for inspection based on evidence mined from social media. Our system, called nEmesis, applies machine learning to realtime Twitter data — a popular micro-blogging service where people post message updates (tweets) that are at most 140 characters long.

A tweet sent from a smartphone is usually tagged with the user’s precise GPS location. We infer the food venues each user visited by “snapping” his or her tweets to nearby establishments (Fig. 1). We develop and apply an automated language model that identifies Twitter users who indicate they suffer from foodborne illness in the text of their public online communication.

As a result, for each venue, we can estimate the number of patrons who fell ill shortly after eating there. In this paper, we build on our prior work, where we showed a correlation between the number of “sick tweets” attributable to a restaurant and it’s historic health inspection score (Sadilek et al. 2013). In this paper, however, we deploy an improved version of the model and validate its predictions in a controlled experiment. The Southern Nevada Health District started a controlled experiment with nEmesis on January 2, 2015. Venues with the highest predicted risk on any given day are flagged and subsequently verified by a thorough inspection by an environmental health specialist.

For each adaptive inspection, we perform a paired control inspection independent of the online data to ensure full annual coverage required by law and to compensate for the geographic bias of Twitter data. During the first 3 months, the environmental health specialists inspected 142 venues, half using nEmesis and half following the standard protocol.

The latter set of inspections constitutes our control group. The inspectors were not told whether the venue comes from nEmesis or control. nEmesis downloads and analyzes all tweets that originate from Las Vegas in real-time. To estimate visits to restaurants, each tweet that is within 50 meters of a food venue is automatically “snapped” to the nearest one as determined by the Google Places API.

We used Google Places to determine the locations of establishments because it includes latitude/longitude data that is more precise than the street address of licensed food venues. As we will see, this decision allowed nEmesis to find problems at unlicensed venues.

For this snapping process, we only consider tweets that include GPS coordinates. Cell phones determine their location through a combination of satellite GPS, WiFi access point fingerprinting, and cell-tower triangularization (Lane et al. 2010). Location accuracy typically ranges from 9 meters to 50 meters and is highest in areas with many cell towers and WiFi access points.

In such cases, even indoor localization (e.g., within a mall) is accurate. Once nEmesis snaps a user to a restaurant, it collects all of his or her tweets for the next five days, including tweets with no geo-tag and tweets sent from outside of Las Vegas. This is important because most restaurant patrons in Las Vegas are tourists, who may not show symptoms of illness until after they leave the city. nEmesis then analyses the text of these tweets to estimate the probability that the user is suffering from foodborne illness.

nemesis data mining
nemesis data mining


Read the entire paper