DOCUMENTATION GITST WEB CRAWLER , SPIDER , WEB Contact Extractor
GITST WEB CRAWLER IS NOT LIMITED WITH JUST EMAIL Extraction and its functional allows to configure it based on requirements of users using REGEX of java you can extract any information from WEB.
MySQL Version of GITST WEB Extractor is in https://sourceforge.net/projects/web-spider-web-crawler-extract/?source=directory
Download and install the program. Before using the program Java should be installed in your computer. After opening the program you will see buttons each of which has specific functionality.
Add crawler- adds new crawler to parse the web. You can run multiple crawlers with different tasks, but it depends on your computer hardware capabilities as well.
Start Selected Crawler – starts all selected crawlers in the list.
Edits selected crawler – changes the parameters of the crawler.
Remove selected crawler – removes the crawler and its data from your computer without possibility to recover.
There is an important component called search condition, where you define what to search and where to search the information.
Functional of Buttons
- Start – saves the configuration and starts the crawler at once
- Add Condition – adds condition into condition list below. You need to configure new condition
- Remove Selected – removes selected conditions from list of conditions
- Save -only saves the crawler into crawler list in main windows (first image) and will require starting the crawler manually.
- Save to file – saves conditions to xml file . It will be possible to reuse by the program later on or send to other users the conditions to use.
- Load from file – loads conditions from file, which were saved by save to file button.
- Cancel – just closes the condition editor.
Values of fields
- To name crawler new crawler task you need to add Crawler name. Under button “Start” it will display unique name of the crawler in crawler list.
- Root depth – is very important value and is one of the main used values in configuration. Root is number of sites (domains, URLS) which added to the crawler configuration list from which starts parsing process. The value shows how many times it should press on found links in the root and how deep it can go in the root domain.
- Link depth – Links are URLs, Domain names which are not the same as Root domains. They are found during process of the parsing. If the parser goes to link the change of domain during parsing is called HOPE. The depth of the parsing not root domain is the depth.
- Thread count is number of threads (processes) which do the crawler tasks simultaneously.
By default, links should not have duplicates and if you select not ignore duplicates the values, which are in search may be repeated.
You can add new conditions in condition list. In condition extra values should be specified.
In left side you should select type of the value .The values can be
- Link – root link from which the parsing should start and need to be added URL in values
- SEW – Search engine words and should be added words you are searching in search engine
- TEXT is the JAVA regex of the content you are searching in the WEB.
In right side you should select what action should be taken.
In case of exclude the values will be excluded from search and in case of include the values will be included into the search.
There is an example of file configuration you can download and load to your crawler and video which shows how the soft works.
After Creating crawler, you will see extra buttons on pressing crawler .
Export is a button which is used to export extracted values from database to excel file., but note, in case of large amount of values it will not be possible to export since the excel file has limitations. You should extract values from local embedded database directly.
Start button will start the crawler and during running of crawler run button it will be seen as stop crawler to stop the process.
When you press twice on crawler open detailed panel where you can press on links and see where the values are parsed.