Mtas can produce statistics on used terms for the individual listed documents. To get this information, in Solr requests, besides the parameter to enable the Mtas query component, the following parameter should be provided.
Parameter | Value | Obligatory |
---|---|---|
mtas.document | true | yes |
Multiple document results can be produced within the same request. To distinguish them, a unique identifier has to be provided for each of the required document results.
Parameter | Value | Info | Obligatory |
---|---|---|---|
mtas.document.<identifier>.key | <string> | key used in response | no |
mtas.document.<identifier>.field | <string> | Mtas field | yes |
mtas.document.<identifier>.prefix | <string> | prefix | yes |
mtas.document.<identifier>.number | <double> | create list with specified number of most frequent items | no |
mtas.document.<identifier>.type | <string> | required type of statistics | no |
mtas.document.<identifier>.regexp | <string> | regular expression condition on term | no |
mtas.document.<identifier>.ignoreRegexp | <string> | regular expression condition for terms that have to be ignored | no |
A list can be provided, specifying the set of terms to consider when computing the result.
Parameter | Value | Info | Obligatory |
---|---|---|---|
mtas.document.<identifier>.list | <string> | comma separated list of values | yes |
mtas.document.<identifier>.listRegexp | <boolean> | list of values are to be interpreted as regular expressions | no |
mtas.document.<identifier>.listExpand | <boolean> | expand the matches on values from list | no |
mtas.document.<identifier>.listExpandNumber | <double> | number of expansions of matches on values from list | no |
Also a ignore list can be provided, specifying the set of terms not to consider when computing the result.
Parameter | Value | Info | Obligatory |
---|---|---|---|
mtas.document.<identifier>.ignoreList | <string> | comma separated list of values | yes |
mtas.document.<identifier>.ignoreListRegexp | <boolean> | list of values are to be interpreted as regular expressions | no |
Example
Statistics for set of unique tokens with prefix t (words) for each listed document.
Request and response
fq=%7B%21mtas_cql+field%3D%22text%22+query%3D%22%5B%5D%22+++%7D&q=%2A%3A%2A&mtas=true&mtas.document=true&mtas.document.0.field=text&mtas.document.0.prefix=t&mtas.document.0.key=words&mtas.document.0.type=all&fl=*&start=0&rows=2&wt=json&indent=true
"mtas":{ "document":[{ "key":"words", "list":[{ "documentKey":"4115a95c-011c-11e4-b0ff-51bcbd7c379f", "sumsq":113964.0, "populationvariance":126.5639231447591, "max":166.0, "sum":3336.0, "kurtosis":92.19837080635624, "standarddeviation":11.257199352433314, "n":789, "quadraticmean":12.01836364230935, "min":1.0, "median":1.0, "variance":126.72453726042504, "mean":4.228136882129286, "geometricmean":1.9285975498109995, "sumoflogs":518.209740627951, "skewness":8.377350653392202}, { "documentKey":"4115aac4-011c-11e4-b0ff-51bcbd7c379f", "sumsq":25489.0, "populationvariance":35.695641666666134, "max":77.0, "sum":1563.0, "kurtosis":72.57030420433823, "standarddeviation":5.979568021426876, "n":600, "quadraticmean":6.517796151051877, "min":1.0, "median":1.0, "variance":35.75523372287092, "mean":2.6050000000000004, "geometricmean":1.5249529474773036, "sumoflogs":253.1781332820801, "skewness":7.70682353088895}]}]}
Example
Most frequent tokens containing only letters a-z and minimum length 5 with prefix t (words) for each listed document.
Regexp
[a-z]{5,}
Request and response
fq=%7B%21mtas_cql+field%3D%22text%22+query%3D%22%5B%5D%22+++%7D&q=%2A%3A%2A&mtas=true&mtas.document=true&mtas.document.0.field=NLContent_mtas&mtas.document.0.prefix=t&mtas.document.0.key=list+of+words&mtas.document.0.type=n%2Csum%2Cmean&mtas.document.0.regexp=%5Ba-z%5D%7B5%2C%7D&mtas.document.0.number=5&fl=%2A&start=0&rows=2&wt=json&indent=true
"mtas":{ "document":[{ "key":"list of words", "list":[{ "documentKey":"c0c4200c-1eee-11e5-b891-f48ce0be173a", "list":[{ "sum":471, "key":"zijne"}, { "sum":317, "key":"eenen"}, { "sum":304, "key":"zegde"}, { "sum":249, "key":"hebben"}, { "sum":229, "key":"welke"}], "mean":4.552402402402403, "sum":30319, "n":6660}, { "documentKey":"c0c453d8-1eee-11e5-b891-f48ce0be173a", "list":[{ "sum":348, "key":"heeft"}, { "sum":243, "key":"hebben"}, { "sum":199, "key":"prins"}, { "sum":173, "key":"vader"}, { "sum":161, "key":"komen"}], "mean":4.641632967456191, "sum":24104, "n":5193}]}]}
Example
Statistics for a provided list of words for each listed document.
List
koe,paard,schaap,geit,kip
Request and response
fq=%7B%21mtas_cql+field%3D%22text%22+query%3D%22%5Bt_lc%3D%5C%22koe%5C%22%7Ct_lc%3D%5C%22paard%5C%22%7Ct_lc%3D%5C%22schaap%5C%22%5D%22+++%7D&q=%2A%3A%2A&mtas=true&mtas.document=true&mtas.document.0.field=text&mtas.document.0.prefix=t_lc&mtas.document.0.key=list+of+words&mtas.document.0.type=n%2Csum%2Cmean&mtas.document.0.list=koe%2Cpaard%2Cschaap%2Cgeit%2Ckip&mtas.document.0.listRegexp=false&mtas.document.0.listExpand=false&mtas.document.0.number=100&fl=%2A&start=0&rows=2&wt=json&indent=true
"mtas":{ "document":[{ "key":"list of words", "list":[{ "documentKey":"c0c46b7a-1eee-11e5-b891-f48ce0be173a", "list":[{ "sum":3, "key":"paard"}, { "sum":2, "key":"schaap"}], "mean":2.5, "sum":5, "n":2}, { "documentKey":"c0c453d8-1eee-11e5-b891-f48ce0be173a", "list":[{ "sum":31, "key":"paard"}, { "sum":1, "key":"kip"}], "mean":16.0, "sum":32, "n":2}]}]}
Example
Statistics for a provided list of regular expressions, ignoring another list of regular expressions for each listed document.
Regexp
[a-z]{7,}
Ignore
[a-z]{10,}
List
een.*,.*heid
Ignore list
een.*heid,ee.*nheid
Request and response
fq=%7B%21mtas_cql+field%3D%22text%22+query%3D%22%5Bt_lc%3D%5C%22eenheid%5C%22%5D%22+++%7D&q=%2A%3A%2A&mtas=true&mtas.document=true&mtas.document.0.field=text&mtas.document.0.prefix=t_lc&mtas.document.0.key=advanced+list+of+words&mtas.document.0.type=n%2Csum%2Cmean&mtas.document.0.regexp=%5Ba-z%5D%7B7%2C%7D&mtas.document.0.list=een.%2A%2C.%2Aheid&mtas.document.0.listRegexp=true&mtas.document.0.listExpand=true&mtas.document.0.listExpandNumber=3&mtas.document.0.ignoreRegexp=%5Ba-z%5D%7B10%2C%7D&mtas.document.0.ignoreList=een.%2Aheid%2Cee.%2Anheid&mtas.document.0.ignoreListRegexp=true&mtas.document.0.number=10&fl=text_numberOfPositions%2CNLCore_NLIdentification_nederlabID%2CNLProfile_name%2CNLTitle_title&start=0&rows=2&wt=json&indent=true
"mtas":{ "document":[{ "key":"advanced list of words", "list":[{ "documentKey":"c0c41486-1eee-11e5-b891-f48ce0be173a", "list":[{ "sum":166, "list":{ "droefheid":{ "sum":36}, "godheid":{ "sum":22}, "waarheid":{ "sum":22}}, "key":".*heid"}, { "sum":93, "list":{ "eenigen":{ "sum":46}, "eensklaps":{ "sum":32}, "eenigste":{ "sum":3}}, "key":"een.*"}], "mean":5.886363636363637, "sum":259, "n":44}, { "documentKey":"c0c453d8-1eee-11e5-b891-f48ce0be173a", "list":[{ "sum":36, "list":{ "afscheid":{ "sum":12}, "hoogheid":{ "sum":4}, "bezigheid":{ "sum":3}}, "key":".*heid"}, { "sum":24, "list":{ "eenvoudig":{ "sum":15}, "eenzame":{ "sum":3}, "eenmaal":{ "sum":2}}, "key":"een.*"}], "mean":3.1578947368421053, "sum":60, "n":19}]}]}
Lucene
To get statistics on used terms for the listed documents directly in Lucene, ComponentDocument together with the provided collect method can be used.