## ESTIMATING PHENYLALANINE OF COMMERCIAL FOODS : A COMPARISON BETWEEN A MATHEMATICAL APPROACH AND A MACHINE LEARNING APPROACH

Phenylketonuria (PKU) is an inherited metabolic disorder affecting 1 in every 10,000 to 15,000 newborns in the United States every year. Caused by a genetic mutation, PKU results in an excessive build up of the amino acid Phenylalanine (Phe) in the body leading to symptoms including but not limited to intellectual disability, hyperactivity, psychiatric disorders and seizures. Most PKU patients must follow a strict diet limited in Phe. The aim of this research study is to formulate, implement and compare techniques for Phe estimation in commercial foods using the information on the food label (Nutritional Fact Label and ordered ingredient list). Ideally, the techniques should be both accurate and amenable to a user friendly implementation as a Phe calculator that would aid PKU patients monitor their dietary Phe intake.

The first approach to solve the above problem is a mathematical one that comprises three steps. The three steps were separately proposed as methods by Jieun Kim in her dissertation. It was assumed that the third method, which is more computationally expensive, was the most accurate one. However, by performing the three methods subsequently in three different steps and combining the results, we actually obtained better results than by merely using the third method.

The first step makes use of the protein content in the foods and Phe:protein multipliers. The second step enumerates all the ingredients in the food and uses the minimum and maximum Phe:protein multipliers of the ingredients along with the protein content. The third step lists the ingredients in decreasing order of their weights, which gives rise to inequality constraints. These constraints hold assuming that there is no loss in the preparation process. The inequality constraints are optimized numerically in two phases. The first involves nutrient content estimation by approximating the ingredient amounts. The second phase is a refinement of the above estimates using the Simplex algorithm. The final Phe range is obtained by performing an interval intersection of the results of the three steps. We implemented all three steps as web applications. Our proposed three-step method yields a high accuracy of Phe estimation (error <= +/- 13.04mg Phe per serving for 90% of foods).

The above mathematical procedure is contrasted against a machine learning approach that uses the data in an existing database as training data to infer the Phe in any given food. Specifically, we use the K-Nearest Neighbors (K-NN) classification method using a feature vector containing the (rounded) nutrient data. In other words, the Phe content of the test food is a weighted average of the Phe values of the neighbors closest to it using the nutrient values as attributes. A four-fold cross validation is carried out to determine the hyper-parameters and the training is performed using the United States Department of Agriculture (USDA) food nutrient database. Our tests indicate that this approach is not very accurate for general foods (error <= +/- 50mg Phe per 100g in about 38% of the foods tested). However, for low-protein foods which are typically consumed by PKU patients, the accuracy increases significantly (error <= +/- 50mg Phe per 100g in over 77% foods).

The machine learning approach is more user-friendly than the mathematical approach. It is convenient, fast and easy to use as it takes into account just the nutrient information. In contrast, the mathematical method additionally takes as input a detailed ingredient list, which is cumbersome to be located in a food database and entered as input. However, the Mathematical method has the added advantage of providing error bounds for the Phe estimate. It is also more accurate than the ML method. This may be due to the fact that for the ML method, the nutrition facts alone are not sufficient to estimate Phe and that additional information like the ingredients list is required.