ITEEDU

使用extract改进内存使用

When Beautiful Soup parses a document, it loads into memory a large, densely connected data structure. If you just need a string from that data structure, you might think that you can grab the string and leave the rest of it to be garbage collected. Not so. That string is a NavigableString object. It's got a parent member that points to a Tag object, which points to other Tag objects, and so on. So long as you hold on to any part of the tree, you're keeping the whole thing in memory.
但Beautiful Soup剖析一个文档的时候,它会将整个文档以一个很大很密集的数据结构中载入内存。 如果你仅仅需要从这个数据结构中获得一个字符串, 你可能觉得为了这个字符串而弄了那么一堆要被当垃圾收集的数据会很不划算。 而且,那个字符串还是NavigableString对象。 也就是要获得一个指向Tag对象的parent的成员,而这个Tag又会指向其他的Tag对象,等等。 因此,你不得不保持一颗剖析树所有部分,也就是把整个东西放在内存里。

The extract method breaks those connections. If you call extract on the string you need, it gets disconnected from the rest of the parse tree. The rest of the tree can then go out of scope and be garbage collected, while you use the string for something else. If you just need a small part of the tree, you can call extract on its top-level Tag and let the rest of the tree get garbage collected.
extrace方法可以破坏这些链接。如果你调用extract来获得你需要字符串, 它将会从树的其他部分中链接中断开。 当你使用这个字符串做什么时,树的剩下部分可以离开作用域而被垃圾收集器捕获。 如果你即使需要一个树的一部分,你也可以讲extract使用在顶层的Tag上, 让其它部分被垃圾收集器收集。

This works the other way, too. If there's a big chunk of the document you don't need, you can call extract to rip it out of the tree, then abandon it to be garbage collected while retaining control of the (smaller) tree. 也可以使用extract实现些别的功能。如果文档中有一大块不是你需要,你也可以使用extract来将它弄出剖析树, 再把它丢给垃圾收集器同时对(较小的那个)剖析树的控制。

If you find yourself destroying big chunks of the tree, you might have been able to save time by not parsing that part of the tree in the first place.
如果你觉得你正在破坏树的大块头,你应该看看 通过剖析部分文档来提升效率来省省时间。