Get matches for a query in a CQP corpus (subcorpus, partition etc.), optionally using the CQP syntax of the Corpus Workbench (CWB).

cpos(.Object, ...)

# S4 method for corpus
cpos(
  .Object,
  query,
  p_attribute = getOption("polmineR.p_attribute"),
  cqp = is.cqp,
  regex = FALSE,
  check = TRUE,
  verbose = TRUE,
  ...
)

# S4 method for character
cpos(
  .Object,
  query,
  p_attribute = getOption("polmineR.p_attribute"),
  cqp = is.cqp,
  check = TRUE,
  verbose = TRUE,
  ...
)

# S4 method for slice
cpos(
  .Object,
  query,
  cqp = is.cqp,
  check = TRUE,
  p_attribute = getOption("polmineR.p_attribute"),
  verbose = TRUE,
  ...
)

# S4 method for partition
cpos(
  .Object,
  query,
  cqp = is.cqp,
  check = TRUE,
  p_attribute = getOption("polmineR.p_attribute"),
  verbose = TRUE,
  ...
)

# S4 method for subcorpus
cpos(
  .Object,
  query,
  cqp = is.cqp,
  check = TRUE,
  p_attribute = getOption("polmineR.p_attribute"),
  verbose = TRUE,
  ...
)

# S4 method for matrix
cpos(.Object)

# S4 method for hits
cpos(.Object)

# S4 method for `NULL`
cpos(.Object)

Arguments

.Object

A length-one character vector indicating a CWB corpus, a partition object, or a matrix with corpus positions.

...

Used for reasons of backwards compatibility to process arguments that have been renamed (e.g. pAttribute).

query

A character vector providing one or multiple queries (token or CQP query). Token ids (i.e. integer values) are also accepted.

p_attribute

The p-attribute to search. Needs to be stated only if query is not a CQP query. Defaults to NULL.

cqp

Either logical (TRUE if query is a CQP query), or a function to check whether query is a CQP query or not (defaults to is.cqp auxiliary function).

regex

Interpret query as a regular expression.

check

A logical value, whether to check validity of CQP query using check_cqp_query.

verbose

A logical value, whether to show messages.

Value

Unless .Object is a matrix, the return value is a matrix with two columns. The first column reports the left/starting corpus positions (cpos) of the hits obtained. The second column reports the right/ending corpus positions of the respective hit. The number of rows is the number of hits. If there are no hits, a NULL object is returned.

Details

If the cpos()-method is applied on a character or partition object, the result is a two-column matrix with the regions (start end end corpus positions of the matches) for a query. CQP syntax can be used. The encoding of the query is adjusted to conform to the encoding of the CWB corpus. If there are not matches, NULL is returned.

If the cpos()-method is called on a matrix object, the cpos matrix is unfolded, the return value is an integer vector with the individual corpus positions.

If .Object is a hits object, an integer vector is returned with the individual corpus positions.

. If .Object is a matrix, it is assumed to be a region matrix, i.e. a two-column matrix with left and right corpus positions in the first and second row, respectively. For many operations, such as decoding the token stream, it is necessary to inflate the denoted regions into a vector of all corpus positions referred to by the regions defined in the matrix. The cpos-method for matrix objects will performs this task robustly.

If .Object is NULL, the method will return an empty integer vector. Used internally to handle NULL objects that may be returned from the cpos-method if no matches are obtained for a query.

Examples

use("polmineR")
#> ... activating corpus: GERMAPARLMINI (version: 0.0.1 | build date: 2019-02-23)
#> ... activating corpus: REUTERS
# looking up single tokens cpos("REUTERS", query = "oil")
#> [,1] [,2] #> [1,] 15 15 #> [2,] 50 50 #> [3,] 57 57 #> [4,] 72 72 #> [5,] 89 89 #> [6,] 119 119 #> [7,] 121 121 #> [8,] 129 129 #> [9,] 163 163 #> [10,] 173 173 #> [11,] 177 177 #> [12,] 200 200 #> [13,] 243 243 #> [14,] 300 300 #> [15,] 370 370 #> [16,] 473 473 #> [17,] 494 494 #> [18,] 549 549 #> [19,] 585 585 #> [20,] 608 608 #> [21,] 773 773 #> [22,] 780 780 #> [23,] 840 840 #> [24,] 1053 1053 #> [25,] 1091 1091 #> [26,] 1183 1183 #> [27,] 1244 1244 #> [28,] 1264 1264 #> [29,] 1294 1294 #> [30,] 1570 1570 #> [31,] 1689 1689 #> [32,] 1704 1704 #> [33,] 1818 1818 #> [34,] 1830 1830 #> [35,] 1953 1953 #> [36,] 2087 2087 #> [37,] 2112 2112 #> [38,] 2165 2165 #> [39,] 2189 2189 #> [40,] 2207 2207 #> [41,] 2295 2295 #> [42,] 2346 2346 #> [43,] 2451 2451 #> [44,] 2498 2498 #> [45,] 2520 2520 #> [46,] 2641 2641 #> [47,] 2785 2785 #> [48,] 2843 2843 #> [49,] 2875 2875 #> [50,] 2892 2892 #> [51,] 2920 2920 #> [52,] 2929 2929 #> [53,] 2984 2984 #> [54,] 3008 3008 #> [55,] 3026 3026 #> [56,] 3053 3053 #> [57,] 3072 3072 #> [58,] 3095 3095 #> [59,] 3144 3144 #> [60,] 3152 3152 #> [61,] 3183 3183 #> [62,] 3211 3211 #> [63,] 3252 3252 #> [64,] 3314 3314 #> [65,] 3319 3319 #> [66,] 3368 3368 #> [67,] 3412 3412 #> [68,] 3463 3463 #> [69,] 3468 3468 #> [70,] 3517 3517 #> [71,] 3585 3585 #> [72,] 3611 3611 #> [73,] 3645 3645 #> [74,] 3710 3710 #> [75,] 3749 3749 #> [76,] 3785 3785 #> [77,] 3835 3835 #> [78,] 3999 3999
corpus("REUTERS") %>% cpos(query = "oil")
#> [,1] [,2] #> [1,] 15 15 #> [2,] 50 50 #> [3,] 57 57 #> [4,] 72 72 #> [5,] 89 89 #> [6,] 119 119 #> [7,] 121 121 #> [8,] 129 129 #> [9,] 163 163 #> [10,] 173 173 #> [11,] 177 177 #> [12,] 200 200 #> [13,] 243 243 #> [14,] 300 300 #> [15,] 370 370 #> [16,] 473 473 #> [17,] 494 494 #> [18,] 549 549 #> [19,] 585 585 #> [20,] 608 608 #> [21,] 773 773 #> [22,] 780 780 #> [23,] 840 840 #> [24,] 1053 1053 #> [25,] 1091 1091 #> [26,] 1183 1183 #> [27,] 1244 1244 #> [28,] 1264 1264 #> [29,] 1294 1294 #> [30,] 1570 1570 #> [31,] 1689 1689 #> [32,] 1704 1704 #> [33,] 1818 1818 #> [34,] 1830 1830 #> [35,] 1953 1953 #> [36,] 2087 2087 #> [37,] 2112 2112 #> [38,] 2165 2165 #> [39,] 2189 2189 #> [40,] 2207 2207 #> [41,] 2295 2295 #> [42,] 2346 2346 #> [43,] 2451 2451 #> [44,] 2498 2498 #> [45,] 2520 2520 #> [46,] 2641 2641 #> [47,] 2785 2785 #> [48,] 2843 2843 #> [49,] 2875 2875 #> [50,] 2892 2892 #> [51,] 2920 2920 #> [52,] 2929 2929 #> [53,] 2984 2984 #> [54,] 3008 3008 #> [55,] 3026 3026 #> [56,] 3053 3053 #> [57,] 3072 3072 #> [58,] 3095 3095 #> [59,] 3144 3144 #> [60,] 3152 3152 #> [61,] 3183 3183 #> [62,] 3211 3211 #> [63,] 3252 3252 #> [64,] 3314 3314 #> [65,] 3319 3319 #> [66,] 3368 3368 #> [67,] 3412 3412 #> [68,] 3463 3463 #> [69,] 3468 3468 #> [70,] 3517 3517 #> [71,] 3585 3585 #> [72,] 3611 3611 #> [73,] 3645 3645 #> [74,] 3710 3710 #> [75,] 3749 3749 #> [76,] 3785 3785 #> [77,] 3835 3835 #> [78,] 3999 3999
corpus("REUTERS") %>% subset(grepl("saudi-arabia", places)) %>% cpos(query = "oil")
#> [,1] [,2] #> [1,] 1689 1689 #> [2,] 1704 1704 #> [3,] 2165 2165 #> [4,] 2189 2189 #> [5,] 2207 2207 #> [6,] 2295 2295 #> [7,] 2346 2346 #> [8,] 2451 2451 #> [9,] 2498 2498 #> [10,] 2520 2520 #> [11,] 2641 2641 #> [12,] 2785 2785 #> [13,] 2843 2843 #> [14,] 2875 2875 #> [15,] 2892 2892 #> [16,] 2920 2920 #> [17,] 2929 2929 #> [18,] 2984 2984 #> [19,] 3008 3008 #> [20,] 3026 3026 #> [21,] 3053 3053
partition("REUTERS", places = "saudi-arabia", regex = TRUE) %>% cpos(query = "oil")
#> ... get encoding: latin1
#> ... get cpos and strucs
#> [,1] [,2] #> [1,] 1689 1689 #> [2,] 1704 1704 #> [3,] 2165 2165 #> [4,] 2189 2189 #> [5,] 2207 2207 #> [6,] 2295 2295 #> [7,] 2346 2346 #> [8,] 2451 2451 #> [9,] 2498 2498 #> [10,] 2520 2520 #> [11,] 2641 2641 #> [12,] 2785 2785 #> [13,] 2843 2843 #> [14,] 2875 2875 #> [15,] 2892 2892 #> [16,] 2920 2920 #> [17,] 2929 2929 #> [18,] 2984 2984 #> [19,] 3008 3008 #> [20,] 3026 3026 #> [21,] 3053 3053
# using CQP query syntax cpos("REUTERS", query = '"Saudi" "Arabia"')
#> [,1] [,2] #> [1,] 2193 2194 #> [2,] 2246 2247 #> [3,] 2614 2615 #> [4,] 2935 2936 #> [5,] 3012 3013 #> [6,] 3036 3037
corpus("REUTERS") %>% cpos(query = '"Saudi" "Arabia"')
#> [,1] [,2] #> [1,] 2193 2194 #> [2,] 2246 2247 #> [3,] 2614 2615 #> [4,] 2935 2936 #> [5,] 3012 3013 #> [6,] 3036 3037
corpus("REUTERS") %>% subset(grepl("saudi-arabia", places)) %>% cpos(query = '"Saudi" "Arabia"', cqp = TRUE)
#> [,1] [,2] #> [1,] 2193 2194 #> [2,] 2246 2247 #> [3,] 2614 2615 #> [4,] 2935 2936 #> [5,] 3012 3013 #> [6,] 3036 3037
partition("REUTERS", places = "saudi-arabia", regex = TRUE) %>% cpos(query = '"Saudi" "Arabia"', cqp = TRUE)
#> ... get encoding: latin1
#> ... get cpos and strucs
#> [,1] [,2] #> [1,] 2193 2194 #> [2,] 2246 2247 #> [3,] 2614 2615 #> [4,] 2935 2936 #> [5,] 3012 3013 #> [6,] 3036 3037