Posts

Towards optimal experimentation in online systems

Image
by CHRIS HAULK It is sometimes useful to think of a large-scale online system ( LSOS ) as an abstract system with parameters $X$ affecting responses $Y$. Here, $X$ is a vector of tuning parameters that control the system's operating characteristics (e.g. the weight given to Likes in our video recommendation algorithm) while $Y$ is a vector of outcome measures such as different metrics of user experience (e.g., the fraction of video recommendations resulted in positive user experiences). If we wish to tune the system parameters $X$ for optimal performance of $Y$, there are several challenges: The relationship between $X$ and $Y$ may be complex and poorly understood It may be impossible to simultaneously maximize every element of $Y$, requiring trade offs There may be hard constraints on the $X$ and $Y$, either individually or in combination, that limit what we deem to be acceptable operating points for the system One approach to this problem is to experiment with one or two system p

Measuring Validity and Reliability of Human Ratings

Image
 by MICHAEL QUINN, JEREMY MILES, KA WONG As data scientists, we often encounter situations in which human judgment provides the ground truth. But humans often disagree, and groups of humans may disagree with each other systematically (say, experts versus laypeople). E ven after we account for disagreement,  human ratings may not measure exactly what we want to measure. How do we think about the quality of human ratings, and how do we quantify our understanding is the subject of this post. Overview Human-labeled data is ubiquitous in business and science, and platforms for obtaining data from people have become increasingly common. Considering this, it is important for data scientists to be able to assess the quality of the data generated by these systems: human judgements are noisy and are often applied to questions where answers might be subjective or rely on contextual knowledge. This post describes a generic framework for understanding the quality of human-labeled data, based arou