Abstract— The cost of acquiring training data instances for induction of machine learning models is one of the main concerns in real-world problems. The web is a comprehensive source for many types of data which can be used for machine learning tasks. But the distributed and dynamic nature of web dictates the use of solutions which are able to handle these characteristics. In this paper, we introduce an automatic topical data acquisition method from the web. We propose a novel type of topical crawlers that use a hybrid link context extraction method for topical crawling to acquire on-topic web pages with minimum bandwidth usage and with the lowest cost. The topical crawlers which use the new link context extraction method which is called Block Text Window (BTW), combines text window method with a block-based method and overcomes challenges of each of these methods using the advantages of the other one. Experimental results show the predominance of BTW in comparison with other automatic topical web data acquisition methods based on standard metrics.
Keywords— cost-sensitive learning, automatic web data acquisition, topical crawlers, link context.
Real-world machine learning problems have different challenges during their process and various types of cost are associated with each step of solutions suggested from the start to the end of this process. Utility or cost based machine learning tries to consider these distinct costs and compare learning methods based on more fairly metrics. This approach considers three main steps especially for the classification task and each step is associated with its related cost during the process. These steps are data acquisition, model induction, and application of the induced model to classify new data . The cost of data acquisition is more neglected than the others in many cost- sensitive machine learning and classification researches. We will consider the cost of data acquisition from the web as efficient use of bandwidth which is available for topical crawlers. The web is one of the most comprehensive sources of information for many machine learning tasks such as classification and clustering. It contains different types of data which include text, image, and other multimedia data. However for acquiring these data from the big, distributed, heterogeneous and dynamic web, we need methods that automatically surf the web pages with efficient use of available bandwidth and collect desired data with predefined target topics. Topical web crawlers are effective tools to cope with this challenge. They begin from some start pages, called seed pages, extract links of these pages and assign some scores to these links based on the usefulness of following these links to reach to on-topic pages.
The main issue in the design of topical web crawlers is to make it possible for them to predict relevancy of pages which current links will lead to. One of the best resources of information in conducting topical crawlers is link context of hyperlinks. According to  context of a hyperlink or link context is defined as the terms that appear in the text around a hyperlink within a Web page. The challenging question in link context extraction is that how around of a hyperlink can be determined. A human can easily understand areas around a hyperlink from its link context, but it is not an easy task for a topical crawler. In this paper we proposed Block Text Window (BTW), a hybrid link context extraction method for topical web crawling. It utilizes the Vision-Based Page Segmentation (VIPS) algorithm  for page segmentation and as this algorithm has some shortages in extracting page blocks accurately, BTW uses text window method  on the text of page blocks to extract link contexts more correctly. We have done empirical studies on the performance of the proposed method and compared it with the most effective existing approaches based on different metrics. The rest of this paper is organized as follows: in the next section we take a look at related works, section three describes the proposed method in detail, section four discuss on experimental results, and the last section contains the conclusion.
Based on the scope of this paper we investigate three interrelated fields: cost-sensitive data acquisition, topical crawling, and link context extraction methods.
Cost-Sensitive Data Acquisition
Many types of researches have been done in fields such as active learning and cost-sensitive feature selection and extraction that stand under the cost-sensitive data acquisition, from some viewpoints. The active learning method in  considers the cost of labeling instances for the proposed recommender system. The authors of  used a combination of deep and active learning for image classification and try to minimize the cost of assigning labels to the instances. Recently in  the researchers proposed a combination of classifier chains and penalized logistic regression which takes into account features cost. Liu et al. proposed a cost-sensitive feature selection method for imbalanced class problems .
Illustration of link context extraction methods by typical samples including: using whole page text, link text, a DOM based method, a Text Window method and an appropriate block based method.
But there are very few researches that consider the cost of collecting cases. Weiss et al.  proposed a cost and utility-based evaluation framework that considers all steps of a machine learning process. They refer to the cost of cases as the cost associated with acquiring complete training examples. Based on the definitions of , the induced model A has more utility than the induced model B if and only if:
〖Cost〗_total (A)<〖Cost〗_total (B) (1)
The 〖Cost〗_total is the sum of all costs during different stages of classification problem and can be computed by:
〖Cost〗_total (M)=〖Cost〗_(data_asquisition) (M)+〖 Cost〗_(model_induction) (M)+ 〖Cost〗_(misclassification&model_application) (M)(2)
Where the cost of data acquisition includes the cost of collecting instances, features (tests) and labels. Cost of model induction includes computational costs. The last cost in (2) describes the misclassification errors and computational cost during the utilization process of the models. In the current research, we focus on the cost of collecting web page instances form the web which can be considered as effective bandwidth usage by topical crawlers.
Topical Crawling Methods
Diligenti et al. introduced an interesting data model called the context graph . This model maintains contents of some training web pages and their distances form relevant target pages in a layered structure. Each layer represents pages with the same distances to relevant pages. By training a classifier for each layer, distance between newly visited pages and target pages can be determined. Han et al. utilized reinforcement learning for topical web crawling . They formulated the problem as a Markov decision process and proposed a new representation of states and actions considering both content information and the link structure. Researchers of  used Hidden Markov Model (HMM) to compute the probability of leading current links to relevant pages. This model needs heavy user interactions for making the HMM model. In a recent paper, Farag et al.  proposed a topical crawler for automatic event tracking and archiving. In the next part, we categorize topical crawling methods based on their link context extraction strategy.
Link Context Extraction Methods
Link context extraction methods can be categorized into four groups: using whole page text and link text, text window method, DOM-based methods and block-based methods. Fig. 1 illustrates the link context extraction methods by typical samples. We describe them in more detail next.
Using Whole Page Text and Link Text: The simplest method for link context extraction is considering the whole text of a web page as link context of all of the page links. Fish search  used this method to score links of a web page and in this way, all the links inside the page will have the same priority for crawling. Another version of this method also uses link text as link context but scores each link using a combination of whole page text relevancy and link context relevancy to the desired topic. This combination can be done based on the following formula:
link_score=β×Relevancy(page_text )+(1-β)× Relevancy(link_context)(3)
Which link_score is the score of a link inside a web page, page_text is whole page text and link_context is extracted link context of the link which in this version of the best first method is equivalent to link text. Relevancy function computes relevancy of given input to the desired topic.
Text Window Method: In this simple method for each hyperlink, a window of T words around the appearance of a hyperlink within a page is considered as its link context . The window is considered to be symmetric concerning link text whenever is possible. It means the window will have T/2 words that appear before and T/2 words that will appear after the link text. The text of the hyperlink will always be included in the text window. This method has an unresolved challenge: We do not know the optimal either near-optimal number of link context terms around a hyperlink.
DOM Based Methods: Document Object Model or DOM of a web page, models a web page as a tree with page HTML tags as its edges, and tags as its nodes. This model is used for the link context extraction in some topical crawling methods. In  based on this idea that texts in different parts of a page and their distances from a hyperlink can help relevancy prediction of the hyperlink target page, Chakrabarti et al. used DOM tree of a web page to compute distances of text tokens positioned in different leaves of DOM tree, from a hyperlink within that page . In another work, Pant et al. used DOM tree of a web page to extract link context of hyperlinks which their anchor text is shorter than a predefined threshold . As we know HTML format is originally proposed to determine a web page layout for web browsers and is not a structured data format such as XML.
Block Based Methods: Related parts of a page are called page block and procedure of extracting page blocks from a web page is called page segmentation. Link context extraction methods based on page segmentation use the text of page block as link contexts of hyperlinks positioned in that block and it is expected that the link extraction based on page segmentation be more accurate and effective than other introduced methods for topical web crawling. Reported results in  encourage this expectation.
The proposed algorithm in  which is called VIPS, also utilizes HTML structure of a web page for page segmentation but does not totally relies on it. VIPS uses visual clues of a web page such as background colors, font sizes, font styles and many other things that affect the layout of a web page and can help in finding related parts of a page which are the blocks. In this algorithm, the vision-based content structure of a page is obtained by combining the DOM structure and the visual cues. The segmentation process has three steps: block extraction, separator detection, and content structure construction. These three steps as a whole are regarded as a round. The algorithm is top-down. At first, the web page is segmented into several big blocks and the hierarchical structure of this level is recorded. For each big block, the same segmentation process is carried out recursively until sufficiently small blocks whose degrees of coherence (DoC) values are greater than a predefined degree of coherence PDoC are produced. After the segmentation process, all the leaf nodes are extracted as blocks. VIPS is one of the strongest algorithms for web page segmentation. We utilized this algorithm in our hybrid method for link context extraction.
Cost-Sensitive Web Data acquisition
The concentration of this research is to propose a method capable of collecting page instances form the web with minimum cost. To achieve this goal we use a novel topical crawling method and will evaluate its performance based on standard metrics. The proposed method is inspired from . Their experimental studies in the field of web information retrieval show that best results are achieved when they used a combination of VIPS algorithm and text window method for page segmentation. In the following of this section, we explain our incentive for using this hybrid method for link context extraction based on empirical studies of  and reported results in the field of topical web crawling. In addition, the proposed hybrid link context extraction method is discussed in details.
Challenges of Solely Using Text Window and VIPS for Link Context Extraction
In this subsection, we investigate challenges of solely using text window method and VIPS algorithm for link context extraction and make clear out intend of proposing the hybrid method which combines both.
Challenges of Text Window Method: Text window method has two major challenges, can’t resolve solely:
The position of link context can’t be determined efficiently.
Optimal (even near optimal) size of the text window is unknown.
Typical text window methods first extract the total text of web page using a depth-first traversal strategy on the DOM tree of the page and then based on the position of a hyperlink determines words which are located in its neighborhood. This approach won’t be able to locate links context positions because of two reasons. First, the HTML structure of a web page does not necessarily show visual layout of the page which a web browser presents to the users and a typical depth-first traversal strategy (or any other blind traversal strategy) which is not aware of visual layout of web page can’t put related texts in the neighborhood of each other. Second, considering a symmetric text window around a hyperlink may put noisy words from one side of the window into the link context in many cases. Because for many hyperlinks, appropriate link context position is before or after hyperlink’s text, not both. On the other hand the suitable size of text window for link context extraction is not clear at all , because web pages belong to the different domain of topics usually have different page layouts and structures, For example, political news web pages have a different layout from academic ones.
Challenges of VIPS Algorithm: When VIPS algorithm is utilized in a link context extraction method, the method will face with two main challenges:
Wrongly extracted blocks may contain noisy words.
Blocks with long texts may involve noisy words.
These two challenges can result in the wrong prediction of link’s target page relevancy. Although VIPS has many advantages, in some situations it may works wrong and can result in three kinds of wrongly extracted blocks:
Blocks segmented more than they should and contain some parts that truly belong to other blocks.
Blocks segmented more than they should and all of their parts truly belong to them.
Blocks segmented less than they should and contain some parts that truly belong to other blocks.
If a block is segmented more than it should, it is rarely the case that it contains parts that truly belong to other blocks. Because its DoC with its real parts which it doesn’t contain now, likely to be much higher than its current DoC. In the other word, it is more likely that a block which is partitioned more than it should, contains its own parts than the other’s one. In cases that the block is segmented more than it should but doesn’t contain parts of other blocks, then it doesn’t contain noisy word and extracted link context based on its content is fragmental but still clear and doesn’t mislead us. However, our experimental observations show that third kind of wrongly extracted blocks that segmented less than they should and contains parts that truly belong to other blocks, produced many times by VIPS algorithm. This kind of blocks contains a considerable number of noisy words that if is included in link contexts of the block hyperlinks, can cause a wrong prediction of their target page relevancies.
When we use VIPS algorithm In addition to wrongly extracted blocks problems, we are encountering with a set of blocks with highly variable lengths. The reported statistic in  shows that 19% of blocks are larger than 200 words. If the whole text of long blocks is considered as its containing hyperlinks’ contexts, then probably extracted link contexts involve noisy words. Intuitively, we know most of the link contexts are not long that much. Thus the variable length problem still exists even if we extract link contexts of hyperlinks based on the text of blocks containing them.
Block Text Window Method
We propose a hybrid link context extraction method which combines VIPS page segmentation algorithm with Text window method. This hybrid method which is called block text window (BTW) employs advantages of each of these combined methods to overcome challenges of the other one. Like CombPS  our BTW method has two steps:
Step 1: block extraction from web pages using VIPS algorithm. In this step, the given page is passed to the VIPS algorithm as its input. VIPS constructs vision-based content structure and all the leaf nodes of this structure are considered as page’s blocks and will be sent to the next step as inputs.
Step 2: link context extraction from blocks using text window method. The text window method is applied to hyperlinks contained in each input block, and link contexts are extracted from the text of blocks that hyperlinks appeared in. Unlike CombPS we do not segment a block based on a set of overlapping windows. For each hyperlink, text windows are considered as T/2 words before and T/2 words after the position of hyperlink’s text in its surrounding block.
input: link: a hyperlink of the page
cStructure: hierarchical vision-based content structure of the page
T: size of the text window
output: linkContext: link context of the input hyperlink
method: BTW- Link-Context-Extractor(link, cStructure, T)
for each leaf ∈ cStructure.leafs do //leaf is a block
if link ∈ leaf.links then
(textWinBefore, textWinAfter) ← Text-Win(link .text,leaf.text ,T)
linkContext ← textWinBefore link.text textWinAfter
Pseudocode of the proposed block text window (BTW) method
Fig. 2 shows the pseudocode of the proposed BTW method. The first step is the base of our hybrid link context extraction method. Extracted blocks by VIPS algorithm approximately determine the appropriate position of link contexts, which for each hyperlink is the text of its containing block. Also, the length of majority of blocks is less than 50 words  which means the VIPS algorithm can determine link context sizes in addition to their positions for many hyperlinks.
What happens in the second step is fine-tuning. We intend in this step to remove noisy words which truly belong to blocks other than the ones they are contained. Of course, we are aware of the disadvantage of applying text windows to block text for link context extraction which is dropping some truly related parts of blocks that makes some of the link contexts imperfect. We believe in a tradeoff between keeping noisy words and dropping some related parts of blocks, the second one will have more benefits for us and can improve the quality of extracted link contexts and topical crawling performance. Experimental results of this paper are witnesses for this assertion. It should be pointed that the second step does not change links context of hyperlinks contained in blocks shorter than the applied window size.
For conclusion we can say that the proposed hybrid method in the first step by utilizing the VIPS algorithm for page segmentation will result in the following advantages:
Appropriate position of link contexts can be determined.
For a majority of blocks which are smaller than the suitable size of a text window (which is about 40 words ), appropriate link contexts size is determined.
And in the second step, applying text window on extracted blocks will have the following benefits:
Noisy words of many wrongly extracted blocks will be prevented from inclusion in link contexts.
Link context sizes will be normalized and noisy words which likely exist in blocks with long texts are dropped from link contexts.
In the next section, we empirically evaluate our hybrid link context extraction approach and compare its performance with some other methods in the field of topical web crawling.
In this section at first, we describe evaluation metrics used for performance evaluation. Then we talk about experimental settings of our evaluation. After that experimental results will be proposed and analyzed based on evaluation metrics.
Since the main cost in our topical data acquisition problem comes from the bandwidth usage, we use standard metrics in topical crawling that can evaluate different methods based on their success on collecting on-topic page with minimum wasting of this resource. We have used two standard evaluation metrics, Harvest Rate, and Target Recall, to describe and compare the performance of different link context extraction methods for topical crawling. These metrics have been used for topical crawling evaluations by many types of researches , , .
Harvest Rate: The harvest rate for t fetched pages from the start of the crawling process until now, H(t), is computed based on the following formula:
Where r_i is the relevancy of page i which is fetched by the crawler. This relevancy is computed based on the binary output of an evaluator classifier. We talk more about evaluator classifier in the next subsection.
Target Recall: To compute this metric, first, a set of target pages, T, is specified. If R(t) is target recall of a topical crawler for t fetched pages up to now then:
Where C(t) is the set of crawled pages from the beginning until page t and |T| shows the number of pages in T.
Compared Methods: Four distinct methods have been compared in this paper which includes: best first method, text window method, a block-based method, and our proposed hybrid method. Reported results of  and  show better performance for versions of best first and text window methods that utilize both link context and page text for computing link scores. Hence we implemented these versions of best first and text window methods and we used formula (3) for combining relevancy scores. In this formula, link_context is considered as link text and text of window for best first and text window methods respectively.
In block-based method, VIPS algorithm is used for page segmentation and performance of two versions of this algorithm is evaluated. One version uses block texts as link context and computes link scores based on context relevancies, and the other version combines the relevancy of link context with the relevancy of page text for computing link scores. The proposed hybrid method extracts link contexts as described in section three. Same as the block-based method we evaluated two versions of BTW. Formula (3) is used for combining page text and link context relevancy for related versions of block-based and BTW method. The β factor in formula (3) is set to 0.25 as found to be effective in , for all of the methods that combine page text score with link context score. Window sizes for text window method and BTW method is set to 10, 20, and 40 same as . PDoC parameter of VIPS algorithm is set to 0.6 according to .
Target Topics: Five distinct topics from different domains are selected from Open Directory Project (ODP)  to evaluate topical crawling methods. ODP is a comprehensive human-edited directory of the Web which contains URLs related to a hierarchy of topics in many domains. Selected topics from ODP are Algorithm, Java, England football, Olympics and Graph theory. We extracted corresponding URLs to each topic and made a set of URLs for each topic. A portion of each URL set is considered as target set T for computing target recall metric. The reported results for each method is the average of obtained results for target topics.
Evaluator and Conductor Classifiers: Corresponding pages of URL sets of each topic are fetched and two types of classifiers are trained using fetched pages as learning instances. One type of classifiers we call evaluator classifiers is responsible for computing relevancy of fetched pages to a topic during the topical crawling process and computing ri in harvest rate formula. We train an evaluator classifier for each topic using all corresponding page of URL set of the topic as positive examples and a double number of positive samples we randomly select pages belonged to other topics of ODP and consider them as negative samples. The other type of classifiers is used by crawlers for computing relevancy of different parts of pages to a topic. We call them conductor classifiers. They trained like evaluator classifiers for each topic but they don’t use the whole set of positive training sample. They only use positive samples which are not included in the target set T of each topic. Multi-Layer Perceptron (MLP) neural network is employed as the classification algorithm which is a good choice for topical crawling .
We established a relatively long crawling process by fetching 30000 pages for each pair of topics and methods.
Performance comparison between BTW and text window method based on average harvest rate metric.
Performance comparison between BTW and text window method based on average target recall metric
Performance comparison between BTW, best first and the block based method based on average harvest rate metric.
Performance comparison between BTW, best first and the block based method based on average target recall metric.
Fig. 3 and 4 illustrate the comparison of text window and BTW methods for different size of windows based on harvest rate and target recall metrics respectively. As can be seen in Fig. 3 harvest rate of all BTW methods is more than harvest rate of text window methods with the same size after crawling 30000 pages. The Superiority of BTW over text window method increases for larger window sizes. Same results are hold based on target recall metric Expect for window size 40 where target recall of BTW method is less but very close to text window method. These results show that using BTW method can result in better topical crawling performance in comparison with the text window method. We believe that harvest rate can reflect topical crawling performance better than target recall metric. Because harvest rate computes relevancy of each crawled web page by utilizing evaluator classifiers but target recall is only concerned about pages which are in target set T and ignores many relevant pages that are not included in T.
In Fig. 5 and 6 we compared two versions of BTW with one version of best first method and two versions of block-based method as they are described in pervious sections. We used window size 40 for BTW method as found to be effective based on the results of the previous experiment. Fig. 5 shows the harvest rate and Fig. 6 shows the target recall of methods. As we can see in Fig. 5 the version of BTW that combines page text relevancy with extracted link context relevancy has better harvest rate than other methods. An interesting point of Fig. 5 results is the higher harvest rate of the block-based method that doesn’t use page text than the version of this method that uses page text. This situation doesn’t occur for evaluated versions of BTW method.
Dramatic changes of method’s performance even after crawling 10000 pages especially based on harvest rate metric shows the necessity of evaluating topical crawling methods based on longer crawling terms as done in this paper.
Comparison of Methods After Crawling 30000 Pages
Link Context Extraction Methods Harvest Rate Target Recall
AVG STDV AVG STDV
Best First – Page Text & Link Text 0.57 0.33 0.05 0.06
Page Text & Text Window 10 0.59 0.30 0.07 0.06
Page Text & Text Window 20 0.64 0.23 0.07 0.05
Page Text & Text Window 40 0.56 0.28 0.09 0.08
Page Text & Block Text 0.60 0.27 0.10 0.10
Block Text 0.67 0.21 0.08 0.08
Page Text & BTW 10 0.61 0.24 0.11 0.09
Page Text & BTW 20 0.69 0.13 0.11 0.12
Page Text & BTW 40 0.72 0.17 0.09 0.08
BTW 40 0.55 0.23 0.07 0.05
To have an overall comparison, average harvest rate and target recall and their standard deviations after crawling 30000 pages are reported in table 1 for different methods. Heuristically it is clear that using a short window size of 10 makes BTW method equivalent to text window method in some aspects as their close performance in table 1empirically shows it to us. To BTW which used a combinational version of this method and used 20 and 40 words as their window sizes have better performance than all of the other methods based on average harvest rate metric and its standard deviation. Also, these two methods have appropriate average target recall in comparison with other evaluated methods. Reported standard deviations in table 1 are not surprising at all when they are compared to reported standard errors in  and .
In this paper, we investigated the cost-sensitive data acquisition problem with the focus on the cost of collecting pages as the learning instances from the web. We used novel topical crawlers as an efficient tool for effective use of available bandwidth to reduce the cost of collecting instances. These crawlers use the new proposed BTW method which is a hybrid link context extraction method for topical web crawling. The method combines text window method with a block-based method that uses VIPS algorithm for page segmentation and overcomes challenges of each of these methods using the advantages of the other one. BTW can find the position and size of link context with acceptable performance. Experimental results show the predominance of BTW in comparison with other link context extraction methods for topical crawling.
G. M. Weiss and Y. Tian, “Maximizing classifier utility when there are data acquisition and modeling costs,” Data Min. Knowl. Discov., vol. 17, no. 2, pp. 253–282, 2008.
G. Pant and P. Srinivasan, “Link contexts in classifier-guided topical crawlers,” IEEE Trans. Knowl. Data Eng., vol. 18, no. 1, pp. 107–122, Jan. 2006.
D. Cai, S. Yu, J. R. Wen, and W. Y. Ma, “VIPS: a visionbased page segmentation algorithm.” Microsoft Technical Report, MSR-TR-2003-79, 2003.
N. Rubens, M. Elahi, M. Sugiyama, and D. Kaplan, “Active Learning in Recommender Systems,” in Recommender Systems Handbook, Boston, MA: Springer US, 2015, pp. 809–846.
Y. Gal, R. Islam, and Z. Ghahramani, “Deep Bayesian Active Learning with Image Data,” in Proceedings of the 34th International Conference on Machine Learning, 2017, vol. 70, pp. 1183–1192.
P. Teisseyre, D. Zufferey, and M. Słomka, “Cost-sensitive classifier chains: Selecting low-cost features in multi-label classification,” Pattern Recognit., vol. 86, pp. 290–319, Feb. 2019.
M. Liu, C. Xu, Y. Luo, C. Xu, Y. Wen, and D. Tao, “Cost-Sensitive Feature Selection by Optimizing F-Measures,” IEEE Trans. Image Process., vol. 27, no. 3, pp. 1323–1335, Mar. 2018.
G. M. Weiss and Y. Tian, “Maximizing classifier utility when training data is costly,” ACM SIGKDD Explor. Newsl., vol. 8, no. 2, pp. 31–38, 2006.
M. Diligenti, F. M. Coetzee, S. Lawrence, C. L. Giles, and M. Gori, “Focused Crawling Using Context Graphs,” in Proceedings of 26th VLDB Conference, 2000, pp. 527–534.
M. Han, P.-H. Wuillemin, and P. Senellart, “Focused Crawling Through Reinforcement Learning,” in ICWE 2018: Web Engineering, 2018, pp. 261–278.
S. Batsakis, E. G. M. Petrakis, and E. Milios, “Improving the performance of focused web crawlers,” Data Knowl. Eng., vol. 68, no. 10, pp. 1001–1013, Oct. 2009.
M. M. G. Farag, S. Lee, and E. A. Fox, “Focused crawler for events,” Int. J. Digit. Libr., vol. 19, no. 1, pp. 3–19, Mar. 2018.
P. M. E. De Bra and R. D. J. Post, “Information retrieval in the World Wide Web: Making client-based searching feasible,” Comput. Networks ISDN Syst., vol. 27, no. 2, pp. 183–192, 1994.
S. Chakrabarti, K. Punera, and M. Subramanyam, “Accelerated focused crawling through online relevance feedback,” in Proceedings of the eleventh international conference on World Wide Web – WWW ’02, 2002, pp. 148–159.
T. Peng, C. Zhang, and W. Zuo, “Tunneling enhanced by web page content block partition for focused crawling,” Concurr. Comput. Pract. Exp., vol. 20, no. 1, pp. 61–74, 2008.
D. Cai, S. Yu, J. R. Wen, and W. Y. Ma, “Block-based web search,” in Proceedings of the 27th ACM SIGIR conference, 2004, pp. 456–463.
C. Wang, Z. Guan, C. Chen, J. Bu, J. Wang, and H. Lin, “On-line topical importance estimation: an effective focused crawling algorithm combining link and content analysis,” J. Zhejiang Univ. Sci. A, vol. 10, no. 8, pp. 1114–1124, Aug. 2009.
“https://dmoztools.net/,” Accessed January 2019.
...(download the rest of the essay above)