Arshavir Blackwell, PhD

Projects

This is a summary of machine-learning related projects I have worked on over the past dozen years, and is by no means complete. Further information about some of these is available on request through more in-depth write-ups.

2010 – present. Senior Scientist/Engineer, CitizenNet, Los Angeles My primary work at CitizenNet revolves around two machine learning technologies: clustering, and classification. One of the main value-adds of CitizenNet is that it takes the complex API of Facebook advertising and targeting and hides it from our users under a layer of sophisticated artificial intelligence. All the user sees is a simple and easy-to-use interface. Based upon our Big Data harvesting of previous responses to Facebook campaigns, at the multi-terabyte level, we are able to develop a profile of which keywords, interests, and other demographic figures tend to be correlated and are the most desirable for a given campaign. Clustering. The first step in this is “clustering,” which allows us to take a heterogeneous group of users and reduce them to a few relevant characteristics: for example, a) women age 16-25 on the East Coast, b) midwest fans of One Direction who also like Electric Daisy Carnival, and so forth. Clustering finds regularities in large data sets and permits us to make characterizations and generalizations about our target populations without having to resort to identifying and characterizing every single one of thousands or millions of users individually. Thus, when a group of, say, 5,000 users maps to a particular cluster, moving forward we can characterize all further targeting with respect to that cluster, rather than with respect to each individual user. Classification. The second component to our back-end process is classification: given a particular cluster, what is the chance that this group will be likely to be interested in our advertising message? Every time a message (ad) is presented to a user who doesn’t click on it (or who doesn’t show some other overt sign of interest), the advertiser is wasting money. Hence, a primary goal of targeting (presenting ad messages only to those users interested in them) is greater efficiency and better use of ad dollars. The classifier uses a variety of machine learning techniques including neural networks, support vector machines, and ensemble networks. It starts with features of relevant keywords from prior campaigns, paired with how well those features did. It uses these data to learn how to predict the click-through rate for a set of features with respect to a given campaign. This is the key to “higher click-through rates,” which is of course the holy grail in most on-line advertising: being able to predict which groups are more likely to be interested in your message and then presenting your message to those groups. Implementation. My work at CitizenNet has involved using both of these techniques, clustering and classification, drawing from a wide variety of machine learning techniques, including K-means clustering, self-organizing maps, multiple regression, neural networks, support vector machines, ensemble classifiers, and genetic algorithms for initial feature selection. This has required on my part both a thorough understanding of the machine learning concepts involved, as well as a knowledge of the everyday nuts-and-bolts of getting such techniques to work in the real world (my principal working languages are Python and Java).

2008 – 2010. Senior Scientist/Engineer, Fox Audience Network, Los Angeles, CA Demographic prediction engine. Demographic statistics such as age and gender are key to refining ad targeting. Often, these values are missing for individual users. I analyzed profiles of users with known demographic values and modeled them, in order to predict values for other users. The goal was to extrapolate from users with known features and predict (e.g.) age and gender in users where those features are missing. This project required a full understanding of the relevant algorithms (e.g., Support Vector Machines) as well as the ability to implement them in Java and to perform large-scale (5 terabyte+) data analyses in Hadoop. Code to pass production-level quality testing. Buzz tracking. Knowing which trends are important over time is key to a variety of marketing initiatives, including ad targeting. Advertisers want to see evidence of lift in terms related to products or services that they are advertising. They may also be interested in what terms related to a product are currently trending high, in order to link those terms to their ads. This system was a response to that need. It used trend and rate analysis algorithms to track changes in frequency and intensity of targeted buzz words. This project required a full understanding of the relevant algorithms (e.g., the bursty stream algorithm) as well as the ability to implement it in Java and to perform large-scale (1 terabyte+) data analyses in Hadoop. Code to pass production-level quality testing. Advise music recommender student internship. Users seek new sources of entertainment on-line, including music and movies. This project addressed this need by creating a prototype music recommender system that leverages the advantages of social media. I directed Harvey Mudd College interns in a nine-month-long project to design and deploy a music recommender system using Java, C++, and JavaScript, the details of which they designed and wrote. This was designed to work as a MySpace Widget plug-in, using social data to mine a user’s social network and develop a cohort of experts for a particular musical genre within the user’s social sphere. This was an innovative approach, and contrasts with other recommender systems that only compare items a user likes to similar items, or groups users based upon shared preferences, without regard to actual social links. Intent Miner. Helped to develop and test system to extract intents from unstructured text (e.g., “intent to purchase car,” “intent to purchase cellphone,” “just married,” “just had a child”) in order to classify users into hyper-targeting groups (e.g., more likely to be interested in home decor or baby products). This project required an understanding of natural language algorithms, Java, and JUnit testing.

2007. Principal Computer Scientist, MetaLINCS, San Jose, CA Innovation team. Supported innovations and improvements to MetaLINCS flagship e-discovery application. This required a) a complete understanding of both the then-current product and its embedded algorithms as well as other algorithms that might be of potential benefit; b) expert skill in Java; and c) expertise in a wide variety of natural language processing algorithms.

2005 – 2007. Senior Scientist/Director of Research, H5 Technologies Lead research and development. Researched and developed improvements to business processes in order to increase accuracy and speed and lower cost. This required a complete understanding of research and analytical methodologies needed to evaluate the performance of the business processes, especially as they relate to large scale document analysis. Build software tools. Using Java and Java Server Faces (JSF), acted as part of the team to architect, develop, and test new tools, particularly search. These tools supported the company•s core mission, which was to analyze very large (on the order of millions) document sets, in order to identify documents relevant to a particular legal case.

2003 – 2005. Senior Engineer & Project Lead, Entrieva, Reston, VA. Unstructured document management applications. Using C++ and Java, acted as lead to maintain and upgrade the then-current categorization software central to solutions provided by Entrieva. Architected new solutions to augment product portfolio in order to expand the company•s services and increase its competitiveness. This required a complete understanding of the company•s proprietary language processing algorithms.

2001 – 2003. Principal, adaptiveLava, Oakland, CA. Peer-to-peer artificial intelligence. Chief architect of application in ongoing project to merge peer-to-peer functionality with artificial intelligence/natural language systems in an enterprise environment, based on open source code. By utilizing search and retrieval algorithms, the application made previously inaccessible files on individual PCs accessible and available to many users within an enterprise or knowledge community, rather than only files specifically pushed to servers. This enabled the enterprise or community to leverage existing intellectual property assets to a level not before possible.

2001. Principal Scientist, Comprecorp, Nevada City, CA. Classifier project. Lead on project to design and implement engine of the Comprecorp Classifier, intended to classify e-mail and other such free-text documents of arbitrary length, according to user-specified categories. The user did not have to tell the system what rules are used to put a document in a particular category. It learned by example from looking at documents already in categories. Potential uses of such a system go beyond e-mail classification, to a wide variety large-scale document management and data mining applications.

1999 – 2000. Senior Engineer, Ask Jeeves, Emeryville, CA. Jeeves automation project. Lead on project to improve accuracy and lower cost for Jeeves question-answering system. This required expert programming skills and a thorough understanding of research and analytical methodologies to evaluate system performance. In the original system, creating and maintaining a knowledge base of questions and answers was too labor intensive, and too costly, to encourage the use of the product by smaller businesses, particularly in the face of increasing competition from other question-answering products. As project lead, I identified bottlenecks in the creation of knowledge bases that were amenable to adaptive automation, created design specifications, and lead a team in writing code to implement those changes. I presented both method and results to company members through meetings and on-line publications, and worked with in-house customers to refine the prototype’s usability. The result was comparable prototype knowledge bases whose creation and maintenance required significantly less human effort and cost less.