Adaptive Stratified Sampling for Precision-Recall Estimation

Ashish Sabharwal
Yexiang Xue
UAI
2018
View in Semantic Scholar

Abstract

We propose a new algorithm for computing a constant-factor approximation of precisionrecall (PR) curves for massive noisy datasets produced by generative models. Assessing validity of items in such datasets requires human annotation, which is costly and must be minimized. Our algorithm, ADASTRAT, is the first data-aware method for this task. It chooses the next point to query on the PR curve adaptively, based on previous observations. It then selects specific items to annotate using stratified sampling. Under a mild monotonicity assumption, ADASTRAT outputs a guaranteed approximation of the underlying precision function, while using a number of annotations that scales very slowly with N , the dataset size. For example, when the minimum precision is bounded by a constant, it issues only log logN precision queries. In general, it has a regret of no more than log logN w.r.t. an oracle that issues queries at data-dependent (unknown) optimal points. On a scaled-up NLP dataset of 3.5M items, ADASTRAT achieves a remarkably close approximation of the true precision function using only 18 precision queries, 13x fewer than best previous approaches.

Consider the setting of Section 2.1, where we use noisy estimatesṽ(u) of v(u) at J randomly chosen points from {1, 2, . . . , r}, in order to estimate p(r).

First, fix a u. Because of the error probability η < 1/2 in the estimate, the expected valueṽ(u) is v(u)(1 − η) + (1 − v(u))η, which equals v(u) + η − 2ηv(u). Second, note that the expected value of j∈J v(t j ) across random choices of J is precisely zp(r), where z = |J|.

Putting these together, we obtain that the expected value of 1 z j∈Jṽ (t j ) across random choices of J is p(r) + η − 2ηp(r). Assuming p(r) ≤ 1/3, this lies within (1 ± η) p(r) and is also within a multiplicative factor of 1 + η of p(r).

Given > η, we can now use the two-sided Hoeffding bound (Hoeffding, 1963) to assess how many samples, z, are needed to obtain a (1 + )-approximation of p(r). For this, it suffices to use a z that guarantees a 1+ 1+ηapproximation of the expected value of 1 z j∈Jṽ (t j ), which, as shown above, is always within a factor of 1 + η of p(r). To this end, a direct application of Hoeffding's inequality, while observing that 1+

1+η − 1 = −η 1+η , yields:

z ≥ (1 + η) 2 2( − η) 2 p(r) 2 ln 2 δ .

B.2 Proof Of Theorem 2

The upper bound:

EQUATION (: Not extracted; please refer to original document.

y)y because of the monotonicity of the partial sum

n j=1 v(t j ). Therefore, p(v ) ≤ p(y)y/v

, which is the result on the first line. When v > y + m, p(v ) ≤ p(y) because of weak (r, m)-monotonicity, which is the result on the fourth line. In the middle, for y + mp(y) <

v ≤ y + m, because of weak (r, m)-monotonicity, p(y + m) ≤ p(y), hence at most p(y) fraction of the en- tries v(t y+1 ), v(t y+2 ), . . . , v(t y+m ) can be 1. This im- plies that p(v )v ≤ p(y)y + mp(y) , which is the third line. For y < v ≤ y + mp(y) , in the worst case all v − y terms in v(y + 1), . . . , v(v ) are 1. Therefore, p(v )v ≤ p(y)y + v − y,

which is the second line.

The lower bound: the first line holds because of weak monotonicity. The fourth line is again because p

(y)y ≤ p(v )v when v ≥ y. In the middle, when y− mp(y) ≤ v < y, at most all y − v items v(t v +1 ), . . . , v(t y ) are 1. Hence, p(v )v + y − v ≥ p(y)y, which gives us the third line. When y − m ≤ v < y − mp(y) , again, because p(y − m) ≥ p(y), at most p(y) fraction of the elements v(t y−m+1 ), . . . , v(t y ) can be 1. This implies that p(v )v + mp(y) ≥ p(y)y.

Rearranging terms, we get the second line.

The envelope is tight because we can construct sequences of v(t j ) which match the upper and lower bound exactly.

B.3 Proof Of Theorem 4

The number of calls to QUERY is one plus the number of intervals (p(l), p(r)) which enter the stopping condition of function PR. Notice that the boundary (l, r) of every interval (p(l), p(r)) which enters the stopping condition must satisfy r/l ≥ 1 + , since otherwise its parent function call already satisfies the stopping condition r/l ≤ (1 + ) 2 . Aggregated across all intervals, this can happen at most log 1+ (N/l) times, as claimed.

B.4 Proof Of Lemma 3

Letl = max m 1+ ,r . Using the full characterization from Theorem 2, we can prove that for

r > l >l , p(r) ≤ (1 + )p(l).

Because of the definition of q 1 , . . . , q K , we know that

q 1 ≤ q 2 ≤ . . . ≤ q K . Moreover, we know that for q i < j < q i+1 , p(j) ≥ p(q i )/(1 + ).

Because of the statement in the previous paragraph, we know that p(j) ≤ p(q i )(1 + ). Proof completes.

B.5 Proof Of Lemma 4

First, the function p(r)r monotonically increases, because for l < r, we have

r p(r) = r i=1 v(t i ) ≥ l i=1 v(t i ) = l p(l). Therefore, p(j)j is sandwiched between p(s i )s i and p(s i+1 )s i+1 whenever j ∈ {s i , . . . , s i+1 }. Because of the definition of s i+1 , for any j ∈ {s i , . . . , s i+1 }, we have p(j)j ≥ p(s i )s i and p(j)j ≤ (1 + )p(s i )s i , which finishes the proof.

B.6 Proof Of Theorem 6

Suppose the "optimal" algorithm calls QUERY at points r 1 , r 2 , . . . , r OPT . We begin with some relevant notation. These points split the entire range betweenl and N into OPT + 1 segments. We call one segment of this type an opt-segment. We also define the following tuple:

p(l), p(r), d, SMALL/BIG , where (p(l), p(r))

is an interval on which the function PR called during the execution of ADASTRAT, d is the depth of this recursive call, and the fourth entry is either SMALL or BIG. It is BIG if and only if the interval [l, r] covers at least one complete opt-segment. The log distance of a tuple is measured as log 1+ r /l. Two tuples are treated as cousins if they are called by a single PR function as two children calls (i.e., one is (p(l), p(c)), while the other one is (p(c), p(r))). We can merge two cousin tuples. The result of merging is the tuple representing the parent function that called the functions represented by these two cousin tuples. Notice that the log distance of the merged tuple is 2 times that of one cousin tuple.

We start with the set Φ, made of tuples

p(l), p(r), d, SMALL/BIG such that (p(l), p(r))

enters the stopping condition of function PR during the execution of the algorithm. We repeat the following merge operation on tuples in Φ until no tuple in Φ is tagged with SMALL: (i) Choose a tuple from Φ that has the largest depth d among those tagged with SMALL. (ii) Merge this tuple with its cousin tuple. (iii) Add the merged tuple back into Φ. We refer to the set Φ obtained after merging all SMALL tuples as Φ E .

First, all tuples in Φ E are tagged BIG. Therefore, each of them contains an opt-segment. Second, since the tuples in Φ E are still from the PR algorithm, these tuples must be non-overlapping. Based on these two observations, the number of tuples in Φ E is bounded by the number of opt-segments, which is OPT+1. How many merge operation could each tuple in Φ E have gone through? This must be bounded by O(log 2 log 1+ N ), since the largest possible log distance is log 1+ N (the entire range), and each merge operation doubles the log distance.

Now we count the number of QUERY calls that algorithm ADASTRAT makes. It is easy to see that

#QUERY = |Φ E | + #MERGE + 1,

where #MERGE is the total number of merge operations performed. This number is bounded by

|Φ E | + |Φ E | • max{#MERGE for one tuple in Φ E } + 1.

Combined with the fact that |Φ E | ≤ OPT + 1 and #MERGE for one tuple in Φ E ≤ log 2 log 1+ N , we obtain the claimed bound.

Figure 1. Not extracted; please refer to original document.

Figure 2. Not extracted; please refer to original document.

Figure 3. Not extracted; please refer to original document.

Figure 4. Not extracted; please refer to original document.

B.7 Detailed Construction Of F To Prove Theorem 7

Our proof works by showing that any algorithm that can identify an unknown function from F , which includes A, must make at least J queries. To demonstrate this, our function family F is constructed such that for one separating point, there are at least two functions from F that are more than (1 + ) 2 apart at the point. Therefore, A has to query all J separating points to identify one function from F . Specifically, we identify J + 1 points a 1 , . . . , a J+1 , where the space between a j and a j+1 is called the jth interval (j = 1, . . . , J), which is shown as the space between two green triangles in Figure 5 . One function f from F either follows the red or the blue curve in one interval. Because we have J intervals in total, this gives us 2 J functions in F . The separating points are shown in yellow circles in Figure 5 . We construct these yellow circles to guarantee that the gap (height) is more than (1 + ) 2 .

Figure 5: An illustration of the function family supporting the lower bound in Theorem 7.

We argue that Algorithm A must make at least one query in each of the J intervals. This is because even if A knows the exact function segment (red or blue) everywhere except for the j-th interval, it still cannot decide if f follows the red or the blue curve within the j-th interval.

The exact locations of separating points are as follows. Let a j = (1 + ) 4j−4 y 0 , b j = (1 + ) 4j−2 y 0 , (j = 1, . . . , J + 1). N = a J+1 = (1 + ) 4J y 0 . Here, y 0 is a sufficiently large number. Let 1 < γ < (1 + ) /(1 + ). Select 2 J (r, m)-weak monotonic functions f 0 , f 1 , . . . , f 2 J −1 such that their values at a j (shown as the green triangles of Figure 5 ) are close. More specifically, for all functions f u ∈ {f 0 , f 1 , . . . , f 2 J −1 }, their values at a j are bounded:

1 γ(1 + ) 2j−2 < f u (a j ) < γ (1 + ) 2j−2 .

For a specific b j (shown as yellow circles in Figure 5 ), half of the functions in {f 0 , f 1 , . . . , f 2 J −1 } match approximately around a particular value (i.e., follow the red curve). The other half match around a different value (i.e., follow the blue curve). More specifically, for all u ∈ {0, 1, . . . , 2 J − 1}, whose j-th digit of its binary representation is 0, f u 's value at b j is bounded by

1 γ(1 + ) 2j−2 < f u (b j ) < γ (1 + ) 2j−2 ,

For all u ∈ {0, 1, . . . , 2 J − 1}, whose j-th digit of its binary representation is 1, f u 's value at b j is bounded by

1 γ(1 + ) 2j < f u (b j ) < γ (1 + ) 2j ,

Suppose algorithm A outputs an (1 + )-approximate curve for any (r, m)-weak monotonic precision function. Starting with one unknown function f drawn from the set {f 0 , f 1 , . . . , f 2 J −1 }, we can use algorithm A to identify which function f actually is. To see this, we examine the approximate valuesf (b 1 ), . . . ,f (b J ) returned by algorithm A. Iff

(b j ) > 1 (1 + ) 2j−1 , we know that 1 γ(1 + ) 2j−2 < f (b j ) < γ (1 + ) 2j−2 ,

because otherwise,f (b j ) and f (b j ) are more than (1 + ) 2 apart. The other side of the argument also holds.

B.8 Proof Of Theorem 8

First, assuming that we have access to r 1 , . . . , r K , we access the values of the firstl terms v(t 1 ), . . . , v(tl).

We then randomly access

r1−l r1

s items fromṽ

(tl +1 ), . . . ,ṽ(t r1 )

, randomly access r2−r1 r2 s items fromṽ(t r1+1 ), . . . ,ṽ(t r2 ), and so on, until we randomly access

r K −r K−1 r K s items fromṽ(t r K−1 +1 ), . . . ,ṽ(t r K ).

When we compute QUERY(r i , T,ṽ), we randomly subsample accessed items in the range of 1, . . . ,l with probability ri−ri−1 ri ; randomly subsample accessed items in the range ofl + 1, . . . , r 1 with probability ri−ri−1 ri r1 r1−l ; randomly subsample accessed items in the range of r 1 + 1, . . . , r 2 with probability ri−ri−1 ri r2 r2−r1 ; and so on; randomly subsample accessed items in the range of r i−2 + 1, . . . , r i−1 with probability ri−ri−1 ri ri−1 ri−1−ri−2 ; and take all accessed items in the range of r i−1 + 1, . . . , r i . We use the empirical mean of these subsampled items to compute the estimationp(r i ). Using Hoeffding's inequality, Eq. (2), we know that the estimation is a βapproximation with confidence at least 1 − δ/K.

Next, applying the union bound over all i ∈ {1, . . . , K}, we have overall confidence of at least 1 − δ that the estimations are β-approximations simultaneously for all K points r 1 , . . . , r K .

Nevertheless, we do not know the location of r 1 , . . . , r K upfront. We thus take an iterative deepening style adaptive sampling scheme, as detailed in the last two paragraphs of Section 4. In particular, we gradually increase the number of samples in each interval as we increase K; and we sample more points for intervals that fall below the sample density thresholds when we introduce new query points.

Table 1: Number of queries of the precision function issued by various algorithms for different size variants of PPDB-36K.

Table 2: Number of annotations used by various algorithms for different size variants of PPDB-36K.