Document

Mtas can produce statistics on used terms for the individual listed documents. To get this information, in Solr requests, besides the parameter to enable the Mtas query component, the following parameter should be provided.

Parameter	Value	Obligatory
mtas.document	true	yes

Multiple document results can be produced within the same request. To distinguish them, a unique identifier has to be provided for each of the required document results.

Parameter	Value	Info	Obligatory
mtas.document.<identifier>.key	<string>	key used in response	no
mtas.document.<identifier>.field	<string>	Mtas field	yes
mtas.document.<identifier>.prefix	<string>	prefix	yes
mtas.document.<identifier>.number	<double>	create list with specified number of most frequent items	no
mtas.document.<identifier>.type	<string>	required type of statistics	no
mtas.document.<identifier>.regexp	<string>	regular expression condition on term	no
mtas.document.<identifier>.ignoreRegexp	<string>	regular expression condition for terms that have to be ignored	no

List

A list can be provided, specifying the set of terms to consider when computing the result.

Parameter	Value	Info	Obligatory
mtas.document.<identifier>.list	<string>	comma separated list of values	yes
mtas.document.<identifier>.listRegexp	<boolean>	list of values are to be interpreted as regular expressions	no
mtas.document.<identifier>.listExpand	<boolean>	expand the matches on values from list	no
mtas.document.<identifier>.listExpandNumber	<double>	number of expansions of matches on values from list	no

Ignore list

Also a ignore list can be provided, specifying the set of terms not to consider when computing the result.

Parameter	Value	Info	Obligatory
mtas.document.<identifier>.ignoreList	<string>	comma separated list of values	yes
mtas.document.<identifier>.ignoreListRegexp	<boolean>	list of values are to be interpreted as regular expressions	no

Examples

Basic : Statistics unique words for each document
Regexp : Most frequent words containing only letters a-z and minimum length 5
List : Statistics for a provided list of words
Ignore : Statistics for a provided list of regular expressions, ignoring another list of regular expressions

Basic

Example
Statistics for set of unique tokens with prefix t (words) for each listed document.

Request and response
fq=%7B%21mtas_cql+field%3D%22text%22+query%3D%22%5B%5D%22+++%7D&q=%2A%3A%2A&mtas=true&mtas.document=true&mtas.document.0.field=text&mtas.document.0.prefix=t&mtas.document.0.key=words&mtas.document.0.type=all&fl=*&start=0&rows=2&wt=json&indent=true

"mtas":{
    "document":[{
        "key":"words",
        "list":[{
            "documentKey":"4115a95c-011c-11e4-b0ff-51bcbd7c379f",
            "sumsq":113964.0,
            "populationvariance":126.5639231447591,
            "max":166.0,
            "sum":3336.0,
            "kurtosis":92.19837080635624,
            "standarddeviation":11.257199352433314,
            "n":789,
            "quadraticmean":12.01836364230935,
            "min":1.0,
            "median":1.0,
            "variance":126.72453726042504,
            "mean":4.228136882129286,
            "geometricmean":1.9285975498109995,
            "sumoflogs":518.209740627951,
            "skewness":8.377350653392202},
          {
            "documentKey":"4115aac4-011c-11e4-b0ff-51bcbd7c379f",
            "sumsq":25489.0,
            "populationvariance":35.695641666666134,
            "max":77.0,
            "sum":1563.0,
            "kurtosis":72.57030420433823,
            "standarddeviation":5.979568021426876,
            "n":600,
            "quadraticmean":6.517796151051877,
            "min":1.0,
            "median":1.0,
            "variance":35.75523372287092,
            "mean":2.6050000000000004,
            "geometricmean":1.5249529474773036,
            "sumoflogs":253.1781332820801,
            "skewness":7.70682353088895}]}]}

Regexp

Example
Most frequent tokens containing only letters a-z and minimum length 5 with prefix t (words) for each listed document.

Regexp

[a-z]{5,}

Request and response
fq=%7B%21mtas_cql+field%3D%22text%22+query%3D%22%5B%5D%22+++%7D&q=%2A%3A%2A&mtas=true&mtas.document=true&mtas.document.0.field=NLContent_mtas&mtas.document.0.prefix=t&mtas.document.0.key=list+of+words&mtas.document.0.type=n%2Csum%2Cmean&mtas.document.0.regexp=%5Ba-z%5D%7B5%2C%7D&mtas.document.0.number=5&fl=%2A&start=0&rows=2&wt=json&indent=true

"mtas":{
    "document":[{
        "key":"list of words",
        "list":[{
            "documentKey":"c0c4200c-1eee-11e5-b891-f48ce0be173a",
            "list":[{
                "sum":471,
                "key":"zijne"},
              {
                "sum":317,
                "key":"eenen"},
              {
                "sum":304,
                "key":"zegde"},
              {
                "sum":249,
                "key":"hebben"},
              {
                "sum":229,
                "key":"welke"}],
            "mean":4.552402402402403,
            "sum":30319,
            "n":6660},
          {
            "documentKey":"c0c453d8-1eee-11e5-b891-f48ce0be173a",
            "list":[{
                "sum":348,
                "key":"heeft"},
              {
                "sum":243,
                "key":"hebben"},
              {
                "sum":199,
                "key":"prins"},
              {
                "sum":173,
                "key":"vader"},
              {
                "sum":161,
                "key":"komen"}],
            "mean":4.641632967456191,
            "sum":24104,
            "n":5193}]}]}

List

Example
Statistics for a provided list of words for each listed document.

List
koe,paard,schaap,geit,kip

Request and response
fq=%7B%21mtas_cql+field%3D%22text%22+query%3D%22%5Bt_lc%3D%5C%22koe%5C%22%7Ct_lc%3D%5C%22paard%5C%22%7Ct_lc%3D%5C%22schaap%5C%22%5D%22+++%7D&q=%2A%3A%2A&mtas=true&mtas.document=true&mtas.document.0.field=text&mtas.document.0.prefix=t_lc&mtas.document.0.key=list+of+words&mtas.document.0.type=n%2Csum%2Cmean&mtas.document.0.list=koe%2Cpaard%2Cschaap%2Cgeit%2Ckip&mtas.document.0.listRegexp=false&mtas.document.0.listExpand=false&mtas.document.0.number=100&fl=%2A&start=0&rows=2&wt=json&indent=true

"mtas":{
    "document":[{
        "key":"list of words",
        "list":[{
            "documentKey":"c0c46b7a-1eee-11e5-b891-f48ce0be173a",
            "list":[{
                "sum":3,
                "key":"paard"},
              {
                "sum":2,
                "key":"schaap"}],
            "mean":2.5,
            "sum":5,
            "n":2},
          {
            "documentKey":"c0c453d8-1eee-11e5-b891-f48ce0be173a",
            "list":[{
                "sum":31,
                "key":"paard"},
              {
                "sum":1,
                "key":"kip"}],
            "mean":16.0,
            "sum":32,
            "n":2}]}]}

Ignore

Example
Statistics for a provided list of regular expressions, ignoring another list of regular expressions for each listed document.

Regexp
[a-z]{7,}

Ignore
[a-z]{10,}

List
een.*,.*heid

Ignore list
een.*heid,ee.*nheid

Request and response
fq=%7B%21mtas_cql+field%3D%22text%22+query%3D%22%5Bt_lc%3D%5C%22eenheid%5C%22%5D%22+++%7D&q=%2A%3A%2A&mtas=true&mtas.document=true&mtas.document.0.field=text&mtas.document.0.prefix=t_lc&mtas.document.0.key=advanced+list+of+words&mtas.document.0.type=n%2Csum%2Cmean&mtas.document.0.regexp=%5Ba-z%5D%7B7%2C%7D&mtas.document.0.list=een.%2A%2C.%2Aheid&mtas.document.0.listRegexp=true&mtas.document.0.listExpand=true&mtas.document.0.listExpandNumber=3&mtas.document.0.ignoreRegexp=%5Ba-z%5D%7B10%2C%7D&mtas.document.0.ignoreList=een.%2Aheid%2Cee.%2Anheid&mtas.document.0.ignoreListRegexp=true&mtas.document.0.number=10&fl=text_numberOfPositions%2CNLCore_NLIdentification_nederlabID%2CNLProfile_name%2CNLTitle_title&start=0&rows=2&wt=json&indent=true

"mtas":{
    "document":[{
        "key":"advanced list of words",
        "list":[{
            "documentKey":"c0c41486-1eee-11e5-b891-f48ce0be173a",
            "list":[{
                "sum":166,
                "list":{
                  "droefheid":{
                    "sum":36},
                  "godheid":{
                    "sum":22},
                  "waarheid":{
                    "sum":22}},
                "key":".*heid"},
              {
                "sum":93,
                "list":{
                  "eenigen":{
                    "sum":46},
                  "eensklaps":{
                    "sum":32},
                  "eenigste":{
                    "sum":3}},
                "key":"een.*"}],
            "mean":5.886363636363637,
            "sum":259,
            "n":44},
          {
            "documentKey":"c0c453d8-1eee-11e5-b891-f48ce0be173a",
            "list":[{
                "sum":36,
                "list":{
                  "afscheid":{
                    "sum":12},
                  "hoogheid":{
                    "sum":4},
                  "bezigheid":{
                    "sum":3}},
                "key":".*heid"},
              {
                "sum":24,
                "list":{
                  "eenvoudig":{
                    "sum":15},
                  "eenzame":{
                    "sum":3},
                  "eenmaal":{
                    "sum":2}},
                "key":"een.*"}],
            "mean":3.1578947368421053,
            "sum":60,
            "n":19}]}]}

Lucene

To get statistics on used terms for the listed documents directly in Lucene, ComponentDocument together with the provided collect method can be used.

MTAS

Document

List

Ignore list

Examples

Basic

Regexp

List

Ignore