README.rst 32.5 KB
Newer Older
Hans Roh's avatar
Hans Roh committed
1 2 3
==================================================
Wissen: Full-Text Search and Classification Engine
==================================================
Hans Roh's avatar
0.11.9  
Hans Roh committed
4

Hans Roh's avatar
Hans Roh committed
5
.. contents:: Table of Contents
Hans Roh's avatar
0.11.3  
Hans Roh committed
6 7


Hans Roh's avatar
Hans Roh committed
8
Introduce
Hans Roh's avatar
0.11  
Hans Roh committed
9 10 11
============

Wissen Search & Classification Engine is a simple search engine mostly written in Python and C in year 2008.
Hans Roh's avatar
Hans Roh committed
12

Hans Roh's avatar
0.11  
Hans Roh committed
13
At that time, I would like to study Lucene_ earlier version with Lupy_ and CLucene_. And I also had maden my own search engine for excercise.
Hans Roh's avatar
Hans Roh committed
14

15
Its file format, numeric compressing algorithm, indexing process are quiet similar with Lucene. But querying and result-fetching parts was built from my imagination. As a result it's entirely unorthodox and possibly inefficient. Wissen's searching mechanism is similar with DNA-RNA-Protein working model translated into 'Index File-Temporary Small Replication Buffer-Query Result'.
Hans Roh's avatar
Hans Roh committed
16

Hans Roh's avatar
0.11.1  
Hans Roh committed
17
* Every searcher (Cell) has a single index file handlers group (DNA group in nuclear)
Hans Roh's avatar
0.11  
Hans Roh committed
18
* Thread has multiple small buffer (RNA) for replicating index as needed part
Hans Roh's avatar
0.11.9  
Hans Roh committed
19
* Query class (Ribosome) creates query result (Protein) by synthesising buffers' inforamtion (RNAs)
Hans Roh's avatar
0.11  
Hans Roh committed
20
* Repeat from 2nd if expected more results
Hans Roh's avatar
Hans Roh committed
21 22 23 24 25 26

.. _Lucene: https://lucene.apache.org/core/
.. _Lupy: https://pypi.python.org/pypi/Lupy
.. _CLucene: http://clucene.sourceforge.net/


Hans Roh's avatar
0.11.4  
Hans Roh committed
27 28
Installation
=============
Hans Roh's avatar
Hans Roh committed
29

Hans Roh's avatar
0.11.4  
Hans Roh committed
30 31
Wissen contains C extension, so need C compiler.
 
Hans Roh's avatar
0.11  
Hans Roh committed
32
.. code:: bash
Hans Roh's avatar
Hans Roh committed
33

Hans Roh's avatar
0.11.4  
Hans Roh committed
34
  pip install wissen
Hans Roh's avatar
Hans Roh committed
35 36 37 38 39 40 41

On posix, it might be required some packages,

.. code:: bash
    
  apt-get install gcc zlib1g-dev
    
Hans Roh's avatar
Hans Roh committed
42 43

Quick Start
Hans Roh's avatar
0.11  
Hans Roh committed
44
============
Hans Roh's avatar
Hans Roh committed
45

Hans Roh's avatar
0.11.1  
Hans Roh committed
46 47
All field text type should be str or utf-8 encoded bytes in Python 3.x, and unicode or utf-8 encoded string in Python 2.7. Otherwise encoding should be specified.

Hans Roh's avatar
0.11.4  
Hans Roh committed
48 49 50 51
Indexing and Searching
-------------------------

Here's an example indexing only one document.
Hans Roh's avatar
Hans Roh committed
52 53 54

.. code:: python

Hans Roh's avatar
0.11.4  
Hans Roh committed
55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
  import wissen
  
  # indexing
  analyzer = wissen.standard_analyzer (max_term = 3000)
  col = wissen.collection ("./col", wissen.CREATE, analyzer)
  indexer = col.get_indexer ()
  
  song = "violin sonata in c k.301"
  composer = u"wolfgang amadeus mozart"
  birth = 1756
  home = "50.665629/8.048906" # Lattitude / Longitude of Salzurg
  genre = "01011111" # (rock serenade jazz piano symphony opera quartet sonata)
  
  document = wissen.document ()
  
  # object to return, any object serializable by pickle
Hans Roh's avatar
Hans Roh committed
71
  document.content ([song, composer])
Hans Roh's avatar
0.11.4  
Hans Roh committed
72 73
  
  # text content to generating auto snippet by given query terms
Hans Roh's avatar
Hans Roh committed
74
  document.snippet (song)
Hans Roh's avatar
0.11.4  
Hans Roh committed
75 76
  
  # add searchable fields
Hans Roh's avatar
Hans Roh committed
77 78 79 80 81
  document.field ("default", song, wissen.TEXT)
  document.field ("composer", composer, wissen.TEXT)
  document.field ("birth", birth, wissen.INT16)
  document.field ("genre", genre, wissen.BIT8)
  document.field ("home", home, wissen.COORD)
Hans Roh's avatar
0.11.4  
Hans Roh committed
82 83 84 85 86 87 88 89 90 91 92
  
  indexer.add_document (document)
  indexer.close ()
  
  # searching
  analyzer = wissen.standard_analyzer (max_term = 8)
  col = wissen.collection ("./col", wissen.READ, analyzer)
  searcher = col.get_searcher ()
  print searcher.query (u'violin', offset = 0, fetch = 2, sort = "tfidf", summary = 30)
  searcher.close ()
  
Hans Roh's avatar
Hans Roh committed
93 94 95 96

Result will be like this:

.. code:: python
Hans Roh's avatar
0.11.4  
Hans Roh committed
97 98
  
  {
Hans Roh's avatar
0.11.17  
Hans Roh committed
99 100 101 102 103 104 105 106 107 108 109 110
   'code': 200, 
   'time': 0, 
   'total': 1
   'result': [
    [
     ['violin sonata in c k.301', 'wofgang amadeus mozart'], # content
     '<b>violin</b> sonata in c k.301', # auto snippet
     14, 0, 0, 0 # additional info
    ]
   ],   
   'sorted': [None, 0], 
   'regex': 'violin|violins',   
Hans Roh's avatar
0.11.4  
Hans Roh committed
111 112 113
  }
  

Hans Roh's avatar
0.11.5  
Hans Roh committed
114
Learning and Classification
Hans Roh's avatar
0.11.4  
Hans Roh committed
115
---------------------------
Hans Roh's avatar
Hans Roh committed
116

Hans Roh's avatar
0.11.4  
Hans Roh committed
117
Here's an example guessing one of 'play golf', 'go to bed' by weather conditions.
Hans Roh's avatar
Hans Roh committed
118 119 120 121 122

.. code:: python

   import wissen
   
Hans Roh's avatar
0.11.1  
Hans Roh committed
123 124
   analyzer = wissen.standard_analyzer (max_term = 3000)
   
Hans Roh's avatar
Hans Roh committed
125
   # learning
Hans Roh's avatar
0.11.1  
Hans Roh committed
126 127
   
   mdl = wissen.model ("./mdl", wissen.CREATE, analyzer)
Hans Roh's avatar
Hans Roh committed
128 129 130
   learner = mdl.get_learner ()
   
   document = wissen.labeled_document ("Play Golf", "cloudy windy warm")
Hans Roh's avatar
0.11  
Hans Roh committed
131
   learner.add_document (document)  
Hans Roh's avatar
Hans Roh committed
132
   document = wissen.labeled_document ("Play Golf", "windy sunny warm")
Hans Roh's avatar
0.11  
Hans Roh committed
133
   learner.add_document (document)  
Hans Roh's avatar
Hans Roh committed
134
   document = wissen.labeled_document ("Go To Bed", "cold rainy")
Hans Roh's avatar
0.11  
Hans Roh committed
135
   learner.add_document (document)  
Hans Roh's avatar
Hans Roh committed
136
   document = wissen.labeled_document ("Go To Bed", "windy rainy warm")
Hans Roh's avatar
0.11.1  
Hans Roh committed
137 138
   learner.add_document (document)   
   learner.close ()
Hans Roh's avatar
Hans Roh committed
139
   
Hans Roh's avatar
0.11.1  
Hans Roh committed
140 141
   mdl = wissen.model ("./mdl", wissen.MODIFY, analyzer)
   learner = mdl.get_learner ()
Hans Roh's avatar
0.11.17  
Hans Roh committed
142
   learner.listbydf () # show all terms with DF (Document Frequency)
Hans Roh's avatar
0.11.1  
Hans Roh committed
143 144 145 146
   learner.close ()
   
   mdl = wissen.model ("./mdl", wissen.MODIFY, analyzer)
   learner = mdl.get_learner ()
Hans Roh's avatar
0.11.3  
Hans Roh committed
147
   learner.build (dfmin = 2) # build corpus DF >= 2
Hans Roh's avatar
0.11.1  
Hans Roh committed
148 149 150 151
   learner.close ()
   
   mdl = wissen.model ("./mdl", wissen.MODIFY, analyzer)
   learner = mdl.get_learner ()
Hans Roh's avatar
0.11  
Hans Roh committed
152
   learner.train (
Hans Roh's avatar
0.11.17  
Hans Roh committed
153 154 155 156 157
     cl_for = wissen.ALL, # for which classifier
     selector = wissen.CHI2, # feature selecting method
     select = 0.99, # how many features?
     orderby = wissen.MAX, # feature ranking by what?
     dfmin = 2 # exclude DF < 2
Hans Roh's avatar
0.11.1  
Hans Roh committed
158
   )
Hans Roh's avatar
Hans Roh committed
159 160 161 162 163
   learner.close ()
   
   
   # gusessing
   
Hans Roh's avatar
0.11.1  
Hans Roh committed
164
   mdl = wissen.model ("./mdl", wissen.READ, analyzer)
Hans Roh's avatar
Hans Roh committed
165
   classifier = mdl.get_classifier ()
Hans Roh's avatar
0.11.3  
Hans Roh committed
166
   print classifier.guess ("rainy cold", cl = wissen.NAIVEBAYES)
Hans Roh's avatar
0.11.14  
Hans Roh committed
167
   print classifier.guess ("rainy cold", cl = wissen.FEATUREVOTE)
Hans Roh's avatar
0.11.3  
Hans Roh committed
168 169
   print classifier.guess ("rainy cold", cl = wissen.TFIDF)
   print classifier.guess ("rainy cold", cl = wissen.SIMILARITY)
Hans Roh's avatar
0.11.14  
Hans Roh committed
170 171 172
   print classifier.guess ("rainy cold", cl = wissen.ROCCHIO)
   print classifier.guess ("rainy cold", cl = wissen.MULTIPATH)
   print classifier.guess ("rainy cold", cl = wissen.META)
Hans Roh's avatar
Hans Roh committed
173 174 175 176 177 178 179
   classifier.close ()
   

Result will be like this:

.. code:: python

Hans Roh's avatar
0.11.4  
Hans Roh committed
180
  {
Hans Roh's avatar
0.11.17  
Hans Roh committed
181 182 183 184 185
    'code': 200, 
    'total': 1, 
    'time': 5,
    'result': [('Go To Bed', 1.0)],
    'classifier': 'meta'  
Hans Roh's avatar
0.11.4  
Hans Roh committed
186
  }
Hans Roh's avatar
Hans Roh committed
187 188


Hans Roh's avatar
0.11.2  
Hans Roh committed
189 190 191 192 193 194 195
Limitation
==============

Before you test Wissen, you should know some limitation.

- Wissen search cannot sort by string type field, but can by int/bit/coord types and TFIDF ranking. 

Hans Roh's avatar
0.11.10  
Hans Roh committed
196
- Wissen classification doesn't have purpose for accuracy but realtime (means within 1 second) guessing performance. So I used relatvely simple and fast classification algorithms. If you need accuracy, it's not fit to you.
Hans Roh's avatar
0.11.2  
Hans Roh committed
197 198


Hans Roh's avatar
0.11.10  
Hans Roh committed
199
Configure Wissen
Hans Roh's avatar
0.11.4  
Hans Roh committed
200
==================
Hans Roh's avatar
0.11.2  
Hans Roh committed
201

Hans Roh's avatar
0.11.4  
Hans Roh committed
202
When indexing/learing it's not necessory to configure, but searching/guessing it should be configure. The reason why Wissen allocates memory per thread for searching and classifying on initializing.
Hans Roh's avatar
Hans Roh committed
203

Hans Roh's avatar
0.11.2  
Hans Roh committed
204
.. code:: python
Hans Roh's avatar
Hans Roh committed
205

Hans Roh's avatar
0.11.4  
Hans Roh committed
206 207 208 209 210 211
  wissen.configure (
    numthread, 
    logger, 
    io_buf_size = 4096, 
    mem_limit = 256
  )
Hans Roh's avatar
Hans Roh committed
212

Hans Roh's avatar
0.11.4  
Hans Roh committed
213 214
 
- numthread: number of threads which access to Wissen collections and models. if set to 8, you can open multiple collections (or models) and access with 8 threads. If 9th thread try to access to wissen, it will raise error
Hans Roh's avatar
Hans Roh committed
215

Hans Roh's avatar
0.11.4  
Hans Roh committed
216
- logger: *see next chapter*
Hans Roh's avatar
Hans Roh committed
217

Hans Roh's avatar
0.11.4  
Hans Roh committed
218
- io_buf_size = 4096: Bytes size of flash buffer for repliacting index files
Hans Roh's avatar
Hans Roh committed
219

Hans Roh's avatar
0.11.4  
Hans Roh committed
220
- mem_limit = 256: Memory limit per a thread, but it's not absolute. It can be over during calculation if need, but when calcuation has been finished, would return memory ASAP.
Hans Roh's avatar
0.11  
Hans Roh committed
221 222


Hans Roh's avatar
0.11.4  
Hans Roh committed
223
Finally when your app is terminated, call shutdown.
Hans Roh's avatar
0.11  
Hans Roh committed
224

Hans Roh's avatar
0.11.2  
Hans Roh committed
225
.. code:: python
Hans Roh's avatar
0.11  
Hans Roh committed
226

Hans Roh's avatar
0.11.4  
Hans Roh committed
227 228
  wissen.shutdown ()
  
Hans Roh's avatar
0.11  
Hans Roh committed
229

Hans Roh's avatar
0.11.4  
Hans Roh committed
230 231
Logger
========
Hans Roh's avatar
0.11  
Hans Roh committed
232

Hans Roh's avatar
0.11.4  
Hans Roh committed
233
.. code:: python
Hans Roh's avatar
0.11  
Hans Roh committed
234

Hans Roh's avatar
0.11.4  
Hans Roh committed
235 236 237 238 239 240
  from wissen.lib import logger
  
  logger.screen_logger ()
  
  # it will create file '/var/log.wissen.log', and rotated by daily base
  logger.rotate_logger ("/var/log", "wissen", "daily")
Hans Roh's avatar
0.11.2  
Hans Roh committed
241 242
  

Hans Roh's avatar
0.11.4  
Hans Roh committed
243 244
Standard Analyzer
====================
Hans Roh's avatar
0.11.2  
Hans Roh committed
245

Hans Roh's avatar
0.11.4  
Hans Roh committed
246
Analyzer is needed by TEXT, TERM types.
Hans Roh's avatar
Hans Roh committed
247

Hans Roh's avatar
0.11.4  
Hans Roh committed
248
Basic Usage is:
Hans Roh's avatar
Hans Roh committed
249

Hans Roh's avatar
0.11.4  
Hans Roh committed
250
.. code:: python
Hans Roh's avatar
0.11  
Hans Roh committed
251

Hans Roh's avatar
0.11.4  
Hans Roh committed
252 253 254 255 256 257 258 259
  analyzer = wissen.standard_analyzer (
    max_term = 8, 
    numthread = 1,
    ngram = True or False,
    stem_level = 0, 1 or 2 (2 is only applied to English Language),
    make_lower_case = True or False,
    stopwords_case_sensitive = True or False,
    ngram_no_space = True or False,
Hans Roh's avatar
0.11.16  
Hans Roh committed
260 261
    strip_html = True or False,  
    contains_alpha_only = True or False,  
Hans Roh's avatar
0.11.4  
Hans Roh committed
262 263
    stopwords = [word,...]
  )
Hans Roh's avatar
Hans Roh committed
264

Hans Roh's avatar
0.11.4  
Hans Roh committed
265
- stem_level: 0 and 1, especially 'en' language has level 2 for hard stemming
Hans Roh's avatar
Hans Roh committed
266

Hans Roh's avatar
0.11.4  
Hans Roh committed
267
- make_lower_case: make lower case for every text
Hans Roh's avatar
0.11  
Hans Roh committed
268

Hans Roh's avatar
0.11.4  
Hans Roh committed
269
- stopwords_case_sensitive: it will work if make_lower_case is False
Hans Roh's avatar
0.11  
Hans Roh committed
270

Hans Roh's avatar
0.11.4  
Hans Roh committed
271
- ngram_no_space: if False, '泣斬 馬謖' will be tokenized to _泣, 泣斬, 斬\_, _馬, 馬謖, 謖\_. But if True, addtional bi-gram 斬馬 will be created between 斬\_ and _馬.
Hans Roh's avatar
Hans Roh committed
272

Hans Roh's avatar
0.11.4  
Hans Roh committed
273
- strip_html
Hans Roh's avatar
Hans Roh committed
274

Hans Roh's avatar
0.11.16  
Hans Roh committed
275 276
- contains_alpha_only: remove term which doesn't contain alphabet, this option is useful for full-text training in some cases

Hans Roh's avatar
0.11.4  
Hans Roh committed
277
- stopwords: Wissen has only English stopwords list, You can use change custom stopwords. Stopwords sould be unicode or utf8 encoded bytes
Hans Roh's avatar
0.11.2  
Hans Roh committed
278 279 280 281 282

Wissen has some kind of stemmers and n-gram methods for international languages and can use them by this way:

.. code:: python

Hans Roh's avatar
0.11.4  
Hans Roh committed
283 284 285
  analyzer = standard_analyzer (ngram = True, stem_level = 1)
  col = wissen.collection ("./col", wissen.CREATE, analyzer)
  indexer = col.get_indexer ()
Hans Roh's avatar
Hans Roh committed
286
  document.field ("default", song, wissen.TEXT, lang = "en")
Hans Roh's avatar
0.11.4  
Hans Roh committed
287

Hans Roh's avatar
0.11.2  
Hans Roh committed
288 289

Implemented Stemmers
Hans Roh's avatar
0.11.4  
Hans Roh committed
290
---------------------
Hans Roh's avatar
0.11.2  
Hans Roh committed
291 292 293

Except English stemmer, all stemmers can be obtained at `IR Multilingual Resources at UniNE`__.

Hans Roh's avatar
0.11.4  
Hans Roh committed
294 295 296 297 298 299 300 301 302 303
  - ar: Arabic
  - de: German
  - en: English
  - es: Spanish
  - fi: Finnish
  - fr: French
  - hu: Hungarian
  - it: Italian
  - pt: Portuguese
  - sv: Swedish
Hans Roh's avatar
0.11.2  
Hans Roh committed
304 305 306 307 308
 
.. __: http://members.unine.ch/jacques.savoy/clef/index.html


Bi-Gram Index
Hans Roh's avatar
0.11.4  
Hans Roh committed
309
----------------
Hans Roh's avatar
0.11.2  
Hans Roh committed
310 311 312

If ngram is set to True, these languages will be indexed with bi-gram.

Hans Roh's avatar
0.11.4  
Hans Roh committed
313 314 315
  - cn: Chinese
  - ja: Japanese
  - ko: Korean
Hans Roh's avatar
0.11.2  
Hans Roh committed
316 317 318 319 320

Also note that if word contains only alphabet, will be used English stemmer.


Tri-Gram Index
Hans Roh's avatar
0.11.4  
Hans Roh committed
321
---------------
Hans Roh's avatar
0.11.2  
Hans Roh committed
322 323 324

The other languages will be used English stemmer if all spell is Alphabet. And if ngram is set to True, will be indexed with tri-gram if word has multibytes.

Hans Roh's avatar
0.11.4  
Hans Roh committed
325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341
**Methods Spec**

  - analyzer.index (document, lang)
  - analyzer.freq (document, lang)
  - analyzer.stem (document, lang)
  - analyzer.count_stopwords (document, lang)


Collection
==================

Collection manages index files, segments and properties.

.. code:: python

  col = wissen.collection (
    indexdir = [dirs], 
Hans Roh's avatar
0.11.11  
Hans Roh committed
342
    mode = [ CREATE | READ | APPEND ], 
Hans Roh's avatar
0.11.4  
Hans Roh committed
343 344 345
    analyzer = None,
    logger = None 
  )
Hans Roh's avatar
0.11.2  
Hans Roh committed
346

Hans Roh's avatar
0.11.4  
Hans Roh committed
347 348 349 350
- indexdir: path or list of path for using multiple disks efficiently
- mode
- analyzer
- logger: # if logger configured by wissen.configure, it's not necessary
Hans Roh's avatar
0.11.2  
Hans Roh committed
351

Hans Roh's avatar
0.11.4  
Hans Roh committed
352
Collection has 2 major class: indexer and searcher.
Hans Roh's avatar
Hans Roh committed
353 354 355



Hans Roh's avatar
0.11.4  
Hans Roh committed
356 357 358 359
Indexer
---------

For searching documents, it's necessary to indexing text to build Inverted Index for fast term query. 
Hans Roh's avatar
Hans Roh committed
360 361 362

.. code:: python

Hans Roh's avatar
0.11.4  
Hans Roh committed
363 364 365 366 367 368 369 370
  indexer = col.get_indexer (
    max_segments = int,
    force_merge = True or False,
    max_memory = 10000000 (10Mb),
    optimize = True or False
  )

- max_segments: maximum number of segments of index, if it's over, segments will be merged. also note during indexing, segments will be created 3 times of max_segments and when called index.close (), automatically try to merge until segemtns is proper numbers
Hans Roh's avatar
Hans Roh committed
371

Hans Roh's avatar
0.11.4  
Hans Roh committed
372
- force_merge: When called index.close (), forcely try to merge to a single segment. But it's failed if too big index - on 32bit OS > 2GB, 64bit > 10 GB
Hans Roh's avatar
Hans Roh committed
373

Hans Roh's avatar
0.11.4  
Hans Roh committed
374
- max_memory: if it's over, created new segment on indexing
Hans Roh's avatar
Hans Roh committed
375

Hans Roh's avatar
0.11.4  
Hans Roh committed
376
- optimize: When called index.close (), segments will be merged by optimal number as possible
Hans Roh's avatar
Hans Roh committed
377

Hans Roh's avatar
0.11.4  
Hans Roh committed
378 379

For add docuemtn to indexer, create document object:
Hans Roh's avatar
Hans Roh committed
380 381

.. code:: python
Hans Roh's avatar
0.11.4  
Hans Roh committed
382 383 384 385 386

  document = wissen.document ()     

Wissen handle 3 objects as completly different objects between no relationship

Hans Roh's avatar
0.11.6  
Hans Roh committed
387 388 389
- returning content
- snippet generating field
- searcherble fields
Hans Roh's avatar
0.11.4  
Hans Roh committed
390 391 392 393 394 395 396 397


**Returning Content**

Wissen serialize returning contents by pickle, so you can set any objects pickle serializable.

.. code:: python

Hans Roh's avatar
Hans Roh committed
398
  document.content ({"userid": "hansroh", "preference": {"notification": "email", ...}})
Hans Roh's avatar
0.11.4  
Hans Roh committed
399 400 401
  
  or 
  
Hans Roh's avatar
Hans Roh committed
402
  document.content ([32768, "This is smaple ..."])
Hans Roh's avatar
0.11.4  
Hans Roh committed
403 404 405 406 407 408 409 410


**Snippet Generating Field**  

This field should be unicode/utf8 encoded bytes.

.. code:: python

Hans Roh's avatar
Hans Roh committed
411
  document.snippet ("This is sample...")
Hans Roh's avatar
0.11.4  
Hans Roh committed
412 413 414 415 416 417 418 419


**Searchable Fields**

document also recieve searchable fields:

.. code:: python

Hans Roh's avatar
Hans Roh committed
420
  document.field (name, value, ftype = wissen.TEXT, lang = "un", encoding = None)
Hans Roh's avatar
0.11.4  
Hans Roh committed
421
  
Hans Roh's avatar
Hans Roh committed
422 423 424 425 426 427
  document.field ("default", "violin sonata in c k.301", wissen.TEXT, "en")
  document.field ("composer", "wolfgang amadeus mozart", wissen.TEXT, "en")
  document.field ("lastname", "mozart", wissen.STRING)
  document.field ("birth", 1756, wissen.INT16)
  document.field ("genre", "01011111", wissen.BIT8)
  document.field ("home", "50.665629/8.048906", wissen.COORD6)
Hans Roh's avatar
0.11.4  
Hans Roh committed
428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498
  
  
- name: if 'default', this field will be searched by simple string, or use 'name:query_text'
- value: unicode/utf8 encode text, or should give encoding arg.
- ftype: *see below*
- encoding: give like 'iso8859-1' if value is not unicode/utf8
- lang: language code for standard_analyzer, "un" (unknown) is default
  
Avalible Field types are:

  - TEXT: analyzable full-text, result-not-sortable
  
  - TERM: analyzable full-text but position data will not be indexed as result can't search phrase, result-not-sortable
  
  - STRING: exactly string match like nation codes, result-not-sortable
  
  - LIST: comma seperated STRING, result-not-sortable
  
  - COORDn, n=4,6,8 decimal precision: comma seperated string 'latitude,longititude', latitude and longititude sould be float type range -90 ~ 90, -180 ~ 180. n is precision of coordinates. n=4 is 10m radius precision, 6 is 1m and 8 is 10cm. result-sortable
  
  - BITn, n=8,16,24,32,40,48,56,64: bitwise operation, bit makred string required by n, result-sortable
  
  - INTn, n=8,16,24,32,40,48,56,64: range, int required, result-sortable


Repeat add_document as you need and close indexer.

.. code:: python

  for ...:  
    document = wissen.document ()
    ...
    indexer.add_document (document) 
    indexer.close ()  

If searchers using this collection runs with another process or thread, searcher automatically reloaded within a few seconds for applying changed index.


Searcher
---------

For running searcher, you should wissen.configure () first and creat searcher.

.. code:: python
  
  searcher = col.get_searcher (
    max_result = 2000,
    num_query_cache = 200
  ) 
  
- max_result: max returned number of searching results. default 2000, if set to 0, unlimited results

- num_query_cache: default is 200, if over 200, removed by access time old


Query is simple:

.. code:: python

  searcher.query (
    qs, 
    offset = 0, 
    fetch = 10, 
    sort = "tfidf", 
    summary = 30, 
    lang = "un"
  )
  
- qs: string (unicode) or utf8 encoded bytes. for detail query syntax, see below
- offset: return start position of result records
- fetch: number of records from offset
Hans Roh's avatar
0.11.9  
Hans Roh committed
499
- sort: "(+-)tfidf" or "(+-)field name", field name should be int/bit type, and '-' means descending (high score/value first) and default if not specified. if sort is "", records order is reversed indexing order
Hans Roh's avatar
0.11.4  
Hans Roh committed
500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520
- summary: number of terms for snippet
- lang: default is "un" (unknown)


For deleting indexed document:

.. code:: python

  searcher.delete (qs)

All documents will be deleted immediatly. And if searchers using this collection run with another process or thread, theses searchers automatically reloaded within a few seconds.

Finally, close searcher.

.. code:: python

  searcher.close ()


**Query Syntax**

Hans Roh's avatar
0.11.6  
Hans Roh committed
521
  - violin composer:mozart birth:1700~1800 
Hans Roh's avatar
0.11.4  
Hans Roh committed
522
  
Hans Roh's avatar
0.11.6  
Hans Roh committed
523 524 525
    search 'violin' in default field, 'mozart' in composer field and search range between 1700, 1800 in birth field
    
  - violin allcomposer:wolfgang mozart
Hans Roh's avatar
0.11.4  
Hans Roh committed
526
  
Hans Roh's avatar
0.11.6  
Hans Roh committed
527 528 529
    search 'violin' in default field and any terms after allcomposer will be searched in composer field
    
  - violin -sonata birth:~1800
Hans Roh's avatar
0.11.4  
Hans Roh committed
530
  
Hans Roh's avatar
0.11.6  
Hans Roh committed
531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553
    not contain sonata in default field
  
  - violin -composer:mozart
  
    not contain mozart in composer field
  
  - violin or piano genre:00001101/all
  
    matched all 5, 6 and 8th bits are 1. also /any or /none is available  
    
  - violin or ((piano composer:mozart) genre:00001101/any)
  
    support unlimited priority '()' and 'or' operators
  
  - (violin or ((allcomposer:mozart wolfgang) -amadeus)) sonata (genre:00001101/none home:50.6656,8.0489~10000)
  
    search home location coordinate (50.6656, 8.0489) within 10 Km
  
  - "violin sonata" genre:00001101/none home:50.6656/8.0489~10
  
    search exaclt phrase "violin sonata"
  
  - "violin^3 piano" -composer:"ludwig van beethoven"
Hans Roh's avatar
0.11.4  
Hans Roh committed
554

Hans Roh's avatar
0.11.6  
Hans Roh committed
555
    search loose phrase "violin sonata" within 3 terms
Hans Roh's avatar
0.11.4  
Hans Roh committed
556

Hans Roh's avatar
Hans Roh committed
557
    
Hans Roh's avatar
0.11.4  
Hans Roh committed
558 559 560 561 562 563 564 565 566
Model
=============

Model manages index, train files, segments and properties.

.. code:: python

  mdl = wissen.model (
    indexdir = [dirs],
Hans Roh's avatar
0.11.17  
Hans Roh committed
567
    mode = [ CREATE | READ | MODIFY | APPEND ], 
Hans Roh's avatar
0.11.4  
Hans Roh committed
568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587
    analyzer = None, 
    logger = None
  )


Learner
---------

For building model, on Wissen, there're 3 steps need.

- Step I. Index documents to learn
- Step II. Build Corpus
- Step III. Selecting features and save trained model

**Step I. Index documents** 

Learner use wissen.labeled_document, not wissen.document. And can additional searchable fields if you need. Label is name of category.

.. code:: python
  
Hans Roh's avatar
0.11.17  
Hans Roh committed
588
  learner = mdl.get_learner ()
Hans Roh's avatar
0.11.4  
Hans Roh committed
589 590
  for label, document in trainset:
  
Hans Roh's avatar
0.11.17  
Hans Roh committed
591 592
    labeled_document = wissen.labeled_document (label, document)	  	      
    # addtional searcherble fields if you need
Hans Roh's avatar
Hans Roh committed
593
    labeled_document.field (name, value, ftype = TEXT, lang = "un", encoding = None)    
Hans Roh's avatar
0.11.17  
Hans Roh committed
594
    learner.add_document (labeled_document)
Hans Roh's avatar
0.11.4  
Hans Roh committed
595 596 597 598 599 600 601 602 603 604 605 606
	  	  
  learner.close ()


**Step II. Building Corpus** 

Document Frequency (DF) is one of major factor of classifier. Low DF is important to searching but not to classifier. One of importance part of learning is selecting valuable terms, but so low DF terms is not very helpful for classifying new document because new document has also low probablity of appearance.

So for learnig/classification efficient, it's useful to eliminate too low and too high DF terms. For example, Let's assume you index 30,000 web pages for learing and there're about 100,000 terms. If you build corpus with all terms, it takes so long time for learing. But if you remove DF < 10 and DF > 7000 terms, 75% - 80% of all terms will be removed.

.. code:: python  
  
Hans Roh's avatar
0.11.17  
Hans Roh committed
607 608 609 610
  # reopen model with MODIFY
  mdl = wissen.model (indexdir, MODIFY)
  learner = mdl.get_learner ()
  
Hans Roh's avatar
0.11.4  
Hans Roh committed
611 612 613 614 615 616
  # show terms order by DF for examin
  learner.listbydf (dfmin = 10, dfmax = 7000)
  
  # build corpus and save
  learner.build (dfmin = 10, dfmax = 7000)
  
Hans Roh's avatar
0.11.17  
Hans Roh committed
617
As a result, corpus built with about 25,000 terms. It will take time by number of terms.
Hans Roh's avatar
0.11.4  
Hans Roh committed
618 619 620 621 622 623


**Step III. Feature Selecting and Saving Model** 

Features means most valuable terms to classify new documents. It is important understanding many/few features is not good for best result. It maybe most important to select good features for classification.

Hans Roh's avatar
0.11.17  
Hans Roh committed
624
For example of my URL classification into 2 classes works show below results. Classifier is NAIVEBAYES, selector is GSS and min DF is 2. Train set is 20,000, test set is 2,000.
Hans Roh's avatar
0.11.6  
Hans Roh committed
625 626 627 628 629 630 631 632 633 634

  - features 3,000 => 82.9% matched, 73 documents is unclassified
  - features 2,000 => 82.9% matched, 73 documents is unclassified
  - features 1,500 => 83.4% matched, 75 documents is unclassified
  - features 1,000 => 83.6% matched, 79 documents is unclassified
  - features   500 => 83.1% matched, 86 documents is unclassified
  - features   200 => 81.1% matched, 108 documents is unclassified
  - features   50 => 76.0% matched, 155 documents is unclassified
  - features   10 => 58.7% matched, 326 documents is unclassified

Hans Roh's avatar
0.11.10  
Hans Roh committed
635
As results show us that over 2,000 snd under 1,000 features will be unchanged or degraded for classification quality. Also to the most classifiers, too few features increase unclassified ratio but especially to NAIVEBAYES, too many features will increase unclassified ratio cause of its calculating way.
Hans Roh's avatar
0.11.6  
Hans Roh committed
636

Hans Roh's avatar
0.11.4  
Hans Roh committed
637 638
.. code:: python  
  
Hans Roh's avatar
0.11.17  
Hans Roh committed
639 640 641
  mdl = wissen.model (indexdir, MODIFY)
  learner = mdl.get_learner ()
  
Hans Roh's avatar
0.11.4  
Hans Roh committed
642 643 644
  learner.train (
    cl_for = [
      ALL (default) | NAIVEBAYES | FEATUREVOTE | 
Hans Roh's avatar
0.11.14  
Hans Roh committed
645
      TFIDF | SIMILARITY | ROCCHIO | MULTIPATH
Hans Roh's avatar
0.11.4  
Hans Roh committed
646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662
    ],
    select = number of features if value is > 1 or ratio,
    selector = [
      CHI2 | GSS | DF | NGL | MI | TFIDF | IG | OR | 
      OR4P | RS | LOR | COS | PPHI | YULE | RMI
    ],
    orderby = [SUM | MAX | AVG],
    dfmin = 0, 
    dfmax = 0
  )
  learner.close ()
  
- cl_for: train for which classifier, if not specified this features used default for every classifiers haven't own feature set. So train () can be called repeatly for each classifiers

- select: number of features if value is > 1 or ratio to all terms. Generally it might be not over 7,000 features for classifying web pages or news articles into 20 classes.

- selector: mathemetical term scoring alorithm to selecting features considering relation between term and term / term and label. Also DF, and term frequency (TF) etc.
Hans Roh's avatar
Hans Roh committed
663

Hans Roh's avatar
0.11.4  
Hans Roh committed
664
- orderby: final scoring method. one of sum, max, average value
Hans Roh's avatar
Hans Roh committed
665

Hans Roh's avatar
0.11.4  
Hans Roh committed
666 667 668
- dfmin, dfmax: In spite of it had been already removed by build(), it can be also additional removed for optimal result for specific classifier


Hans Roh's avatar
0.11.17  
Hans Roh committed
669 670 671 672 673 674 675 676 677 678 679 680
If you remove training data for specific classifier,

.. code:: python  
  
  mdl = wissen.model (indexdir, MODIFY)
  learner = mdl.get_learner ()
  
  learner.untrain (cl_for)
  learner.close ()


**Finding Best Training Options**
Hans Roh's avatar
0.11.4  
Hans Roh committed
681 682 683

Generally, differnce attibutes of data set, it hard to say which options are best. It is stongly necessary number of times repeating process between train () and guess () for best result and that's not easy process.

Hans Roh's avatar
0.11.17  
Hans Roh committed
684 685
- index ()
- build ()
Hans Roh's avatar
0.11.4  
Hans Roh committed
686 687
- train (initial options)
- measure results with guess ()
Hans Roh's avatar
0.11.17  
Hans Roh committed
688
- append additional documents, build () if need
Hans Roh's avatar
0.11.4  
Hans Roh committed
689 690 691
- train (another options)
- measure results again with guess ()
- ...
Hans Roh's avatar
0.11.17  
Hans Roh committed
692
- find best optiaml training options with your data set
Hans Roh's avatar
0.11.4  
Hans Roh committed
693

Hans Roh's avatar
0.11.10  
Hans Roh committed
694
For getting result accuracy, your pre-requisite data should be splitted into train set for tran () and test set for guess () to measure like `precision and recall`_.
Hans Roh's avatar
0.11.4  
Hans Roh committed
695

Hans Roh's avatar
0.11.10  
Hans Roh committed
696 697
For example, there were 27,000 web pages to training set and 2,700 test set for classifying to spam page or not. Total indexed terms are 199,183 and I eliminated 94% terms by DF < 30 or DF > 7000 and remains only 10,221 terms.

Hans Roh's avatar
0.11.11  
Hans Roh committed
698
- F: selected features by OR(Odds Ratio) MAX
Hans Roh's avatar
0.11.12  
Hans Roh committed
699
- NB: NAIVEBAYES, RO: ROCCHIO
Hans Roh's avatar
0.11.10  
Hans Roh committed
700 701 702 703 704 705 706 707
- Numbers means: Matched % Ratio Excluding Unclassified (Unclassified Documents)

  - F 7,000: NB 97.2 (1,100), RO 95.4 (50)
  - F 5,000: NB 97.4 (493), RO 94.8 (69) 
  - F 4,000: NB 96.6 (282), RO 91.6 (96)
  - F 3,000: NB 93.2 (214), RO 86.2 (151)
  - F 2,000: NB 89.4 (293), RO 80.1 (281)

Hans Roh's avatar
0.11.12  
Hans Roh committed
708
Which do you choice? In my case, I choose F 5,000 with ROCCHIO cause of low unclassified ratio. But if speed was more important I might choice F 3,000 with NAIVEBAYES.
Hans Roh's avatar
0.11.10  
Hans Roh committed
709 710

Anyway everything is done, and if you has been found optimal parameters, you can optimize classifier model.
Hans Roh's avatar
0.11.4  
Hans Roh committed
711

Hans Roh's avatar
0.11.17  
Hans Roh committed
712
.. code:: python
Hans Roh's avatar
0.11.4  
Hans Roh committed
713

Hans Roh's avatar
0.11.17  
Hans Roh committed
714
  mdl = wissen.model (indexdir, wissen.MODIFY, an)
Hans Roh's avatar
0.11.8  
Hans Roh committed
715 716 717
  learner = mdl.get_learner ()
  learner.optimize ()
  learner.close ()
Hans Roh's avatar
0.11.4  
Hans Roh committed
718

Hans Roh's avatar
0.11.17  
Hans Roh committed
719
Note that once called optimize (),
Hans Roh's avatar
0.11.4  
Hans Roh committed
720

Hans Roh's avatar
0.11.17  
Hans Roh committed
721 722
- you cannot add additional training documents
- you cannot rebuild corpus by calling build () again
Hans Roh's avatar
0.11.20  
Hans Roh committed
723
- but you can still call train () any time
Hans Roh's avatar
0.11.6  
Hans Roh committed
724

Hans Roh's avatar
0.11.9  
Hans Roh committed
725
The reason why when low/high DF terms are eliminated by optimize (), related index files will be also shrinked unrecoverably for performance. Then if these works are needed, you should do from step I again.
Hans Roh's avatar
0.11.17  
Hans Roh committed
726

Hans Roh's avatar
0.11.12  
Hans Roh committed
727
If you don't do optimize it make SIMILARITY and ROCCHIO classifiers inefficient (also it will be NOT influence to NAIVEBAYES, TFDIF, FEATUREVOTE classifiers). But you think it's more important retraining regulary rather than speed performance, you should not optimize.
Hans Roh's avatar
0.11.17  
Hans Roh committed
728 729

.. _`precision and recall`: https://en.wikipedia.org/wiki/Precision_and_recall
Hans Roh's avatar
0.11.4  
Hans Roh committed
730 731 732 733 734 735 736


**Feature Selecting Methods**

  - CHI2 = Chi Square Statistic
  - GSS = GSS Coefficient 
  - DF = Document Frequency
737
  - CF = Category Frequency
Hans Roh's avatar
0.11.4  
Hans Roh committed
738 739 740 741 742
  - NGL = NGL
  - MI = Mutual Information
  - TFIDF = Term Frequecy - Inverted Document Frequency
  - IG = Information Gain
  - OR = Odds Ratio
Hans Roh's avatar
0.11.6  
Hans Roh committed
743
  - OR4P = Kind of Odds Ratio(? can't remember)
Hans Roh's avatar
0.11.4  
Hans Roh committed
744 745 746 747 748 749 750
  - RS = Relevancy Score
  - LOR = Log Odds Ratio
  - COS = Cosine Similarity 
  - PPHI = Pearson's PHI
  - YULE = Yule
  - RMI = Residual Mutual Information
  
Hans Roh's avatar
0.11.6  
Hans Roh committed
751
I personally prefer OR, IG and GSS selectors with MAX method.
Hans Roh's avatar
0.11.4  
Hans Roh committed
752

Hans Roh's avatar
0.11.17  
Hans Roh committed
753

Hans Roh's avatar
0.11.4  
Hans Roh committed
754 755 756 757 758 759 760 761 762 763 764
Classifier
------------
  
Finally,

.. code:: python  
  
  classifier = mdl.get_classifier ()
  classifier.quess (
    qs, 
    lang = "un", 
Hans Roh's avatar
0.11.14  
Hans Roh committed
765 766 767 768
    cl = [ 
      NAIVEBAYES (Default) | FEATUREVOTE | ROCCHIO | 
      TFIDF | SIMILARITY | META | MULTIPATH
    ],
Hans Roh's avatar
0.11.5  
Hans Roh committed
769
    top = 0,
Hans Roh's avatar
0.11.4  
Hans Roh committed
770 771
    cond = ""
  )
Hans Roh's avatar
Hans Roh committed
772 773 774 775 776 777
  
  classifier.cluster (
    qs, 
    lang = "un"    
  )
  
Hans Roh's avatar
0.11.4  
Hans Roh committed
778 779 780 781 782 783 784 785
  classifier.close ()
  
- qs: full text stream to classify

- lang

- cl: which classifer, META is default

Hans Roh's avatar
0.11.5  
Hans Roh committed
786
- top: how many high scored classified results, default is 0, means high scored result(s) only
Hans Roh's avatar
0.11.4  
Hans Roh committed
787

Hans Roh's avatar
Hans Roh committed
788
- cond: conditional document selecting query. Some classifier execute calculating with lots of documents like ROCCHIO and SIMILARITY, so it's useful shrinking number of documents. This  only work when you put additional searchable fields using labeled_document.field (...).
Hans Roh's avatar
0.11.6  
Hans Roh committed
789

Hans Roh's avatar
0.11.17  
Hans Roh committed
790 791
**Implemented Classifiers**

Hans Roh's avatar
0.11.14  
Hans Roh committed
792
  - NAIVEBAYES: Naive Bayes Probablility, default guessing
Hans Roh's avatar
0.11.17  
Hans Roh committed
793
  - FEATUREVOTE: Feature Voting Classifier
Hans Roh's avatar
0.11.12  
Hans Roh committed
794
  - ROCCHIO: Rocchio Classifier
Hans Roh's avatar
0.11.17  
Hans Roh committed
795 796
  - TFIDF: Max TDIDF Score
  - SIMILARITY: Max Cosine Similarity
Hans Roh's avatar
0.11.14  
Hans Roh committed
797 798
  - MULTIPATH: Experimental Multi Path Classifier, terms of classifying document will be clustered into multiple sets by co-word frequency before guessing
  - META: merging and decide with multiple results guessed by NAIVEBAYES, FEATUREVOTE, ROCCHIO Classifiers
Hans Roh's avatar
0.11.17  
Hans Roh committed
799 800 801 802 803

If you need speed most of all, NAIVEBAYES is a good choice. NAIVEBAYES is an old theory but it still works with very high performance at both speed and accuracy if given proper training set.

More detail for each classifier alorithm, googling please.

Hans Roh's avatar
0.11.6  
Hans Roh committed
804 805

**Optimizing Each Classifiers**
Hans Roh's avatar
0.11.4  
Hans Roh committed
806 807 808 809 810 811 812

For give some detail options to a classifier you can use setopt (classfier, option name = option value,...).


.. code:: python  

  classifier = mdl.get_classifier ()
Hans Roh's avatar
0.11.12  
Hans Roh committed
813
  classifier.setopt (wissen.ROCCHIO, topdoc = 200)
Hans Roh's avatar
0.11.4  
Hans Roh committed
814
  
Hans Roh's avatar
0.11.12  
Hans Roh committed
815
SIMILARITY, ROCCHIO classifiers basically have to compare with entire indexed document documents, but Wissen can compare with selected documents by 'topdoc' option. These number of documents will be selected by high TFIDF score for classifying performance reason. Default topdoc value is 100. If you set to 0, Wissen will compare with all documents have one of features at least. But on my experience, there's no critical difference except speed performance.
Hans Roh's avatar
0.11.6  
Hans Roh committed
816

Hans Roh's avatar
0.11.14  
Hans Roh committed
817 818
Currently available options are:

Hans Roh's avatar
Hans Roh committed
819 820 821 822
* ALL

  - verbose = False

Hans Roh's avatar
0.11.14  
Hans Roh committed
823 824 825 826 827 828
* ROCCHIO

  - topdoc = 100

* MULTIPATH

Hans Roh's avatar
0.11.18  
Hans Roh committed
829 830 831 832
  + subcl = [ FEATUREVOTE (default) | NAIVEBAYES | ROCCHIO ]
  + scoreby = [ IG (default) | MI | OR | R ]
  + choiceby = [ AVG (default) | MIN ], when scorring between term and each terms in cluster, which do you want to use value
  + threshold = 1.0, float value for creating new cluster and this value is measured with Information Gain and value range is somewhat different by number of training documents.
Hans Roh's avatar
Hans Roh committed
833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853


Document Cluster
-----------------

TODO

.. code:: python  

  cluster = mdl.get_dcluster ()
  

Term Cluster
-------------

TODO

.. code:: python  

  cluster = mdl.get_tcluster ()
  
Hans Roh's avatar
0.11.18  
Hans Roh committed
854
    
Hans Roh's avatar
0.12  
Hans Roh committed
855 856 857 858 859 860 861 862 863 864 865 866 867

Handling Multiple Searchers & Classifiers
===========================================

In case of creating multiple searchers and classifers, wissen.task might be useful.
Here's a script named 'config.py'

.. code:: python

  import wissen
  from wissen.lib import logger
  
  def start_wissen (numthreads, logger):    
Hans Roh's avatar
0.12.1  
Hans Roh committed
868
    wissen.configure (numthreads, logger)
Hans Roh's avatar
0.12  
Hans Roh committed
869 870 871
        
    analyzer = wissen.standard_analyzer ()
    col = wissen.collection ("./data1", wissen.READ, analyzer)
Hans Roh's avatar
0.12.1  
Hans Roh committed
872
    wissen.assign ("data1", col.get_searcher (max_result = 2000))
Hans Roh's avatar
0.12  
Hans Roh committed
873 874
    
    analyzer = wissen.standard_analyzer (max_term = 1000, stem = 2)
Hans Roh's avatar
0.12.1  
Hans Roh committed
875 876
    mdl = wissen.model ("./data2", wissen.READ, analyzer)
    wissen.assign ("data2", mdl.get_classifier ())
Hans Roh's avatar
0.12  
Hans Roh committed
877 878 879 880 881 882 883 884 885 886 887
  
The first argument of assign () is alias for searcher or classifier.

If you call config.start_wissen () at any script, you can just import wissen and use it at another python scripts.

.. code:: python

  import wissen
  
  wissen.query ("data1", "mozart sonatas")
  wissen.guess ("data2", "mozart sonatas")
Hans Roh's avatar
Hans Roh committed
888 889 890 891 892
  
  # close and resign  
  wissen.close ("data1")
  wissen.resign ("data1")

Hans Roh's avatar
0.12  
Hans Roh committed
893 894 895 896 897 898 899 900 901 902

At the end of you app, call wissen.shutdown ()
  
.. code:: python

  import wissen
  
  wissen.shutdown ()


Hans Roh's avatar
Hans Roh committed
903 904
API Export Using Skitai
=========================
Hans Roh's avatar
0.12  
Hans Roh committed
905

Hans Roh's avatar
Hans Roh committed
906
**New in version 0.12.14**
Hans Roh's avatar
0.12.1  
Hans Roh committed
907

Hans Roh's avatar
Hans Roh committed
908
You can use RESTful API with `Skitai-Saddle`_.
Hans Roh's avatar
0.12.1  
Hans Roh committed
909

Hans Roh's avatar
Hans Roh committed
910
Copy and save below code to app.py.
Hans Roh's avatar
0.12  
Hans Roh committed
911 912 913

.. code:: python
  
Hans Roh's avatar
Hans Roh committed
914
  import os
Hans Roh's avatar
0.12  
Hans Roh committed
915
  import wissen
Hans Roh's avatar
Hans Roh committed
916
  import skitai	
Hans Roh's avatar
0.12.2  
Hans Roh committed
917
  
Hans Roh's avatar
Hans Roh committed
918
  pref = skitai.pref ()
Hans Roh's avatar
Hans Roh committed
919 920
  pref.use_reloader = 1
  pref.debug = 1
Hans Roh's avatar
0.12  
Hans Roh committed
921
  
Hans Roh's avatar
Hans Roh committed
922
  config = pref.config
Hans Roh's avatar
Hans Roh committed
923
  config.sched = "0/15 * * * *"
Hans Roh's avatar
Hans Roh committed
924
  config.enable_mirror = False
Hans Roh's avatar
Hans Roh committed
925 926
  config.remote = "http://192.168.1.100:5000"
  config.local = "http://127.0.0.1:5000"
Hans Roh's avatar
0.12  
Hans Roh committed
927
  
Hans Roh's avatar
Hans Roh committed
928
  config.enable_index = False
Hans Roh's avatar
Hans Roh committed
929
  config.resource_dir = skitai.joinpath ('resources')
Hans Roh's avatar
0.12.1  
Hans Roh committed
930
  
Hans Roh's avatar
Hans Roh committed
931
  config.logpath = os.name == "posix" and '/var/log/assai' or None
Hans Roh's avatar
0.12.1  
Hans Roh committed
932
  
Hans Roh's avatar
Hans Roh committed
933
  skitai.mount ("/v1", wissen, "app", pref)
Hans Roh's avatar
Hans Roh committed
934 935 936 937
  skitai.run (	
  	port = 5000,
  	logpath = config.logpath
  )
Hans Roh's avatar
0.12.1  
Hans Roh committed
938

Hans Roh's avatar
Hans Roh committed
939
And run
Hans Roh's avatar
0.12.1  
Hans Roh committed
940

Hans Roh's avatar
Hans Roh committed
941
.. code:: bash
Hans Roh's avatar
0.12.1  
Hans Roh committed
942

Hans Roh's avatar
Hans Roh committed
943
  python app.py -v
Hans Roh's avatar
0.12.1  
Hans Roh committed
944

945 946 947 948 949 950
Here's example of client side indexer script using API.

.. code:: python

  colopt = {
    'data_dir': [
Hans Roh's avatar
Hans Roh committed
951 952 953
    	'models/0/books',
    	'models/1/books',
    	'models/2/books'
954 955 956 957 958 959 960 961 962
    ],
    'analyzer': {
    	"ngram": 0,
    	"stem_level": 1,						
    	"strip_html": 0,
    	"make_lower_case": 1		
    },
    'indexer': {
    	'force_merge': 0,
Hans Roh's avatar
Hans Roh committed
963
    	'optimize': 1, 
964 965 966
    	'max_memory': 10000000,
    	'max_segments': 10,
    },	
Hans Roh's avatar
Hans Roh committed
967 968 969 970
    'searcher': {
      'max_result': 2000,
      'num_query_cache': 200
    }
971 972
  }	
  
Hans Roh's avatar
Hans Roh committed
973
  import requests    
974
  session = requests.Session ()
Hans Roh's avatar
Hans Roh committed
975
  
976
  # check current collections
Hans Roh's avatar
Hans Roh committed
977
  r = session.get ('http://127.0.0.1:5000/v1/').json ()
978 979 980 981 982 983 984
  if 'books' not in r ["collections"]:  
    # collections dose not exist, then create
    session.post ('http://127.0.0.1:5000/v1/books', colopt)
  
  dbc = db.connect (...)
  cursor = dbc.curosr ()
  cursor.execute (...)  
Hans Roh's avatar
Hans Roh committed
985
  
986 987 988 989
  numdoc = 0
  while 1:
    row = cursor.fetchone ()
    if not row: break
Hans Roh's avatar
Hans Roh committed
990 991 992 993 994 995 996
    doc = wissen.document (row._id)
    doc.content ({"author": row.author, "title": row.title , "abstract": row.abstract})
    doc.snippet (row.abstract)
    doc.field ('default', "%s %s" % (row.title, row.abstract), wissen.TEXT, 'en')
    doc.field ('title', row.title, wissen.TEXT, 'en')
    doc.field ('author', row.author, wissen.STRING)
    doc.field ('isbn', row.isbn, wissen.STRING)
Hans Roh's avatar
Hans Roh committed
997 998
    doc.field ('year', row.year, wissen.INT16) 
       
Hans Roh's avatar
Hans Roh committed
999
    session.post ('http://127.0.0.1:5000/v1/books/documents', doc.as_json ())
1000 1001
    numdoc += 1
    if numdoc % 1000:
Hans Roh's avatar
Hans Roh committed
1002
    	session.get ('http://127.0.0.1:5000/v1/books/commit')
Hans Roh's avatar
Hans Roh committed
1003 1004 1005
  
  cursor.close ()
  dbc.close ()
Hans Roh's avatar
Hans Roh committed
1006

Hans Roh's avatar
Hans Roh committed
1007
APIs are:
Hans Roh's avatar
Hans Roh committed
1008 1009 1010

.. code:: python
  
Hans Roh's avatar
Hans Roh committed
1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033
  # add new collection with options
  session.post ('http://127.0.0.1:5000/v1", colopt)  
  # get collection status and options
  session.get ('http://127.0.0.1:5000/v1/books")  
  # modify collection options
  session.patch ('http://127.0.0.1:5000/v1/books", colopt)  
  # remove collection with all index files
  session.remove ('http://127.0.0.1:5000/v1/books")
  
  # get collection locks
  session.get ('http://127.0.0.1:5000/v1/books/locks")  
  # create 'custom' lock
  session.post ('http://127.0.0.1:5000/v1/books/locks/custom")  
  # delete 'custom' lock
  session.delete ('http://127.0.0.1:5000/v1/books/locks/custom")
  
  
  # add new document
  session.post (
    'http://127.0.0.1:5000/v1/books/documents", 
    doc.as_json ()
  )
  # modify document
Hans Roh's avatar
Hans Roh committed
1034 1035 1036 1037
  session.patch (
    'http://127.0.0.1:5000/v1/books/documents/" + row._id, 
    doc.as_json ()
  )
Hans Roh's avatar
Hans Roh committed
1038
  # delete document by document_id
Hans Roh's avatar
Hans Roh committed
1039 1040
  session.delete ('http://127.0.0.1:5000/v1/books/documents/" + row._id)
  
Hans Roh's avatar
Hans Roh committed
1041
  # search
Hans Roh's avatar
Hans Roh committed
1042
  session.get ('http://127.0.0.1:5000/v1/books/search?q=title:book")
Hans Roh's avatar
Hans Roh committed
1043
  # guess
Hans Roh's avatar
Hans Roh committed
1044
  session.get ('http://127.0.0.1:5000/v1/books/guess?q=title:book")
Hans Roh's avatar
Hans Roh committed
1045 1046
  # delete documents by search
  session.delete ('http://127.0.0.1:5000/v1/books/search?q=title:book")
Hans Roh's avatar
Hans Roh committed
1047
  
Hans Roh's avatar
Hans Roh committed
1048
  # commit
Hans Roh's avatar
Hans Roh committed
1049
  session.get ('http://127.0.0.1:5000/v1/books/commit')
Hans Roh's avatar
Hans Roh committed
1050
  # rollback  
Hans Roh's avatar
Hans Roh committed
1051
  session.get ('http://127.0.0.1:5000/v1/books/rollback')  
Hans Roh's avatar
Hans Roh committed
1052
  
Hans Roh's avatar
Hans Roh committed
1053
For more detail about API, see `app.py`_.
Hans Roh's avatar
Hans Roh committed
1054
     
Hans Roh's avatar
0.12  
Hans Roh committed
1055
.. _`Skitai-Saddle`: https://pypi.python.org/pypi/skitai
Hans Roh's avatar
Hans Roh committed
1056
.. _`app.py`: https://gitlab.com/hansroh/wissen/blob/master/wissen/export/skitai/app.py
Hans Roh's avatar
0.12  
Hans Roh committed
1057

Hans Roh's avatar
Hans Roh committed
1058

Hans Roh's avatar
Hans Roh committed
1059 1060 1061
Links
======

Hans Roh's avatar
Hans Roh committed
1062 1063
- `GitLab Repository`_
- Bug Report: `GitLab issues`_
Hans Roh's avatar
Hans Roh committed
1064

Hans Roh's avatar
Hans Roh committed
1065 1066
.. _`GitLab Repository`: https://gitlab.com/hansroh/wissen
.. _`GitLab issues`: https://gitlab.com/hansroh/wissen/issues
Hans Roh's avatar
Hans Roh committed
1067 1068 1069



Hans Roh's avatar
Hans Roh committed
1070
Change Log
Hans Roh's avatar
0.11  
Hans Roh committed
1071 1072
============
  
Hans Roh's avatar
Hans Roh committed
1073 1074
  0.13
  
Hans Roh's avatar
Hans Roh committed
1075 1076
  - fix file creation mode on posix
  - fix using lock with multiple workers
Hans Roh's avatar
Hans Roh committed
1077 1078 1079
  - change wissen.document method names
  - fix index queue file locking
  
Hans Roh's avatar
Hans Roh committed
1080
  0.12 
Hans Roh's avatar
0.12  
Hans Roh committed
1081
  
Hans Roh's avatar
Hans Roh committed
1082
  - add biword arg to standard_analyzer
Hans Roh's avatar
Hans Roh committed
1083
  - change export package name from appack to package
Hans Roh's avatar
Hans Roh committed
1084
  - add Skito-Saddle app
Hans Roh's avatar
Hans Roh committed
1085
  - fix analyzer.count_stopwords return value
Hans Roh's avatar
Hans Roh committed
1086 1087 1088 1089
  - change development status to Alpha
  - add wissen.assign(alias, searcher/classifier) and query(alias), guess(alias)
  - fix threads count and memory allocation
  - add example for Skitai-Saddle app to mannual
Hans Roh's avatar
Hans Roh committed
1090
  
Hans Roh's avatar
Hans Roh committed
1091
  0.11 
Hans Roh's avatar
0.9.1  
Hans Roh committed
1092
  
Hans Roh's avatar
Hans Roh committed
1093 1094 1095 1096
  - fix HTML strip and segment merging etc.
  - add MULTIPATH classifier
  - add learner.optimize ()
  - make learner.build & learner.train efficient
Hans Roh's avatar
Hans Roh committed
1097
  
Hans Roh's avatar
Hans Roh committed
1098
  0.10 - change version format, remove all str*_s ()
Hans Roh's avatar
Hans Roh committed
1099
  
Hans Roh's avatar
Hans Roh committed
1100
  0.9 - support Python 3.x
Ubuntu's avatar
Ubuntu committed
1101

Hans Roh's avatar
Hans Roh committed
1102
  0.8 - change license from BSD to GPL V3