When Beautiful Soup parses a document, it loads into memory a large,
densely connected data structure. If you just need a string from that
data structure, you might think that you can grab the string and leave
the rest of it to be garbage collected. Not so. That string is a NavigableString
object. It's got a parent
member that points to a Tag
object, which points to other Tag
objects, and so on. So long
as you hold on to any part of the tree, you're keeping the whole thing
in memory.
但Beautiful Soup剖析一个文档的时候,它会将整个文档以一个很大很密集的数据结构中载入内存。
如果你仅仅需要从这个数据结构中获得一个字符串,
你可能觉得为了这个字符串而弄了那么一堆要被当垃圾收集的数据会很不划算。
而且,那个字符串还是NavigableString
对象。
也就是要获得一个指向Tag
对象的parent
的成员,而这个Tag又会指向其他的Tag
对象,等等。
因此,你不得不保持一颗剖析树所有部分,也就是把整个东西放在内存里。
The extract
method breaks those connections. If you call extract
on the string you need, it gets disconnected from the rest
of the parse tree. The rest of the tree can then go out of scope and
be garbage collected, while you use the string for something else. If
you just need a small part of the tree, you can call extract
on its
top-level Tag
and let the rest of the tree get garbage collected.
extrace
方法可以破坏这些链接。如果你调用extract
来获得你需要字符串,
它将会从树的其他部分中链接中断开。
当你使用这个字符串做什么时,树的剩下部分可以离开作用域而被垃圾收集器捕获。
如果你即使需要一个树的一部分,你也可以讲extract
使用在顶层的Tag
上,
让其它部分被垃圾收集器收集。
This works the other way, too. If there's a big chunk of the
document you don't need, you can call extract
to rip it out
of the tree, then abandon it to be garbage collected while retaining
control of the (smaller) tree.
也可以使用extract实现些别的功能。如果文档中有一大块不是你需要,你也可以使用extract
来将它弄出剖析树,
再把它丢给垃圾收集器同时对(较小的那个)剖析树的控制。
If you find yourself destroying big chunks of the tree, you might
have been able to save time by not
parsing that part of the tree in the first place.
如果你觉得你正在破坏树的大块头,你应该看看 通过剖析部分文档来提升效率来省省时间。