intbitset

About

Provides an intbitset data object holding unordered sets of unsigned integers with ultra fast set operations, implemented via bit vectors and Python C extension to optimize speed and memory usage.

Emulates the Python built-in set class interface with some additional specific methods such as its own fast dump and load marshalling functions.

intbitset additionally support the pickle protocol, the iterator protocol and can behave like a sequence type.

Usage

Example:

>>> from intbitset import intbitset
>>> x = intbitset([1,2,3])
>>> y = intbitset([3,4,5])
>>> x & y
intbitset([3])
>>> x | y
intbitset([1, 2, 3, 4, 5])

Notes

  • Uses real bits to optimize memory usage, so may have issues with endianness

if you transport serialized bitsets between various machine architectures.

  • Please note that no bigger than __maxelem__ elements can be added to an intbitset.
  • On modern CPUs, vectorial instruction sets (such as MMX/SSE) are exploited

to further optimize speed.

Performance

Here is an example of performance gain with respect to traditional set of positive integers (example of ipython session):

>>> ## preparation
>>> from intbitset import intbitset
>>> from random import sample
>>> sparse_population1 = sample(range(1000000), 10000)
>>> sparse_population2 = sample(range(1000000), 10000)
>>> dense_population1 = sample(range(1000000), 900000)
>>> dense_population2 = sample(range(1000000), 900000)
>>> sparse_set1 = set(sparse_population1)
>>> sparse_set2 = set(sparse_population2)
>>> sparse_intbitset1 = intbitset(sparse_population1)
>>> sparse_intbitset2 = intbitset(sparse_population2)
>>> dense_set1 = set(dense_population1)
>>> dense_set2 = set(dense_population2)
>>> dense_intbitset1 = intbitset(dense_population1)
>>> dense_intbitset2 = intbitset(dense_population2)
>>> sorted(sparse_population1)[5000:5002]
[500095, 500124]
>>> in_sparse = 500095
>>> not_in_sparse = 500096
>>> sorted(dense_population1)[500000:500002]
[555705, 555707]
>>> in_dense = 555705
>>> not_in_dense = 555706

For sparse sets, intbitset operations are typically 50 times faster than set operations.

>>> ## Sparse sets operations
>>> %timeit sparse_set1 & sparse_set2
1000 loops, best of 3: 263 µs per loop
>>> %timeit sparse_intbitset1 & sparse_intbitset2 ## more than 20 times faster
100000 loops, best of 3: 11.6 µs per loop
>>> %timeit sparse_set1 | sparse_set2
1000 loops, best of 3: 891 µs per loop
>>> %timeit sparse_intbitset1 | sparse_intbitset2 ## almost 70 times faster
100000 loops, best of 3: 12.8 µs per loop
>>> %timeit sparse_set1 ^ sparse_set2
1000 loops, best of 3: 1.09 ms per loop
>>> %timeit sparse_intbitset1 ^ sparse_intbitset2 ## more than 80 times faster
100000 loops, best of 3: 12.9 µs per loop
>>> %timeit sparse_set1 - sparse_set2
1000 loops, best of 3: 739 µs per loop
>>> %timeit sparse_intbitset1 - sparse_intbitset2 ## almost 60 times faster
100000 loops, best of 3: 12.5 µs per loop

For dense sets, intbitset operations are typically 5000 times faster than set operations:

>>> ## Dense sets operations
>>> %timeit dense_set1 & dense_set2
10 loops, best of 3: 62.1 ms per loop
>>> %timeit dense_intbitset1 & dense_intbitset2 ## more than 5000 times faster
100000 loops, best of 3: 12.3 µs per loop
>>> %timeit dense_set1 | dense_set2
10 loops, best of 3: 84.1 ms per loop
>>> %timeit dense_intbitset1 | dense_intbitset2 ## more than 6000 times faster
100000 loops, best of 3: 12.5 µs per loop
>>> %timeit dense_set1 ^ dense_set2
10 loops, best of 3: 64.2 ms per loop
>>> %timeit dense_intbitset1 ^ dense_intbitset2 ## more than 5000 times faster
100000 loops, best of 3: 12.6 µs per loop
>>> %timeit dense_set1 - dense_set2
10 loops, best of 3: 38.6 ms per loop
>>> timeit dense_intbitset1 - dense_intbitset2 ## more than 3000 times faster
100000 loops, best of 3: 12.8 µs per loop

Membership operations in intbitset behave in a comparable way than set objects, albeit with slightly better performance:

>>> ## Membership tests
>>> %timeit in_sparse in sparse_set1
10000000 loops, best of 3: 66.8 ns per loop
>>> %timeit in_sparse in sparse_intbitset1 ## 1.5 times faster
10000000 loops, best of 3: 42.8 ns per loop
>>> %timeit not_in_sparse in sparse_set1
10000000 loops, best of 3: 71.3 ns per loop
>>> %timeit not_in_sparse in sparse_intbitset1 ## 1.6 times faster
10000000 loops, best of 3: 44.7 ns per loop
>>> %timeit in_dense in dense_set1
10000000 loops, best of 3: 61.8 ns per loop
>>> %timeit in_dense in dense_intbitset1 ## 1.3 times faster
10000000 loops, best of 3: 45.3 ns per loop
>>> %timeit not_in_dense in dense_set1
10000000 loops, best of 3: 45.5 ns per loop
>>> %timeit not_in_dense in dense_intbitset1 ## similar speed
10000000 loops, best of 3: 41.4 ns per loop

Serialising can be up to 30 times faster:

>>> ## serialization speed
>>> ## note: internally intbitset compress using zlib so we are
>>> ## going to also compress the equivalent set
>>> from zlib import compress, decompress
>>> from marshal import dumps, loads
>>> %timeit loads(decompress(compress(dumps(sparse_set1))))
100 loops, best of 3: 6.55 ms per loop
>>> %timeit intbitset(sparse_intbitset1.fastdump()) ## 15% faster
100 loops, best of 3: 5.63 ms per loop
>>> %timeit loads(decompress(compress(dumps(dense_set1))))
1 loops, best of 3: 565 ms per loop
>>> %timeit intbitset(dense_intbitset1.fastdump()) ## almost 30 times faster for dense sets
10 loops, best of 3: 20.9 ms per loop

Serialising can lead to 20 times smaller footprint:

>>> len(compress(dumps(sparse_set1)))
29349
>>> len(sparse_intbitset1.fastdump()) ## almost half the space
16166
>>> len(compress(dumps(dense_set1)))
1363026
>>> len(dense_intbitset1.fastdump()) ## 5% of the space for dense set
70332

Reference

Defines an intbitset data object to hold unordered sets of unsigned integers with ultra fast set operations, implemented via bit vectors and Python C extension to optimize speed and memory usage.

Emulates the Python built-in set class interface with some additional specific methods such as its own fast dump and load marshalling functions. Uses real bits to optimize memory usage, so may have issues with endianness if you transport serialized bitsets between various machine architectures.

Please note that no bigger than __maxelem__ elements can be added to an intbitset and, if CFG_INTBITSET_ENABLE_SANITY_CHECKS is disabled, you will receive unpredictable results.

Note to developers: If you make modification to this file you have to manually regenerate intbitset.c by running:

$ cython intbitset.pyx

and then commit generated intbitset.c.

class intbitset.intbitset

Defines an intbitset data object to hold unordered sets of unsigned integers with ultra fast set operations, implemented via bit vectors and Python C extension to optimize speed and memory usage.

Emulates the Python built-in set class interface with some additional specific methods such as its own fast dump and load marshalling functions. Uses real bits to optimize memory usage, so may have issues with endianness if you transport serialized bitsets between various machine architectures.

The constructor accept the following parameters:
rhs=0, int preallocate=-1, int trailing_bits=0, bint sanity_checks=CFG_INTBITSET_ENABLE_SANITY_CHECKS, int no_allocate=0:

where rhs can be:

  • int/long for creating allocating empty intbitset that will hold at least rhs elements, before being resized
  • intbitset for cloning
  • bytes for retrieving an intbitset that was dumped into a byte string
  • array for retrieving an intbitset that was dumped into a string stored in an array
  • sequence made of integers for copying all the elements from the sequence. If minsize is specified than it is initially allocated enough space to hold up to minsize integers, otherwise the biggest element of the sequence will be used.
  • sequence made of tuples: then the first element of each tuple is considered as an integer (as in the sequence made of integers).

The other arguments are:

  • preallocate is a suggested initial upper bound on the numbers that will be stored, by looking at rhs a sequence of number.
  • trailing_bits is 1, then the set will contain “all” the positive integers
  • no_allocate and sanity_checks are used internally and should never be set.
__and__

Return the intersection of two intbitsets as a new set. (i.e. all elements that are in both intbitsets.)

__cmp__
__contains__

x.__contains__(y) <==> y in x

__deepcopy__()
__delitem__

x.__delitem__(y) <==> del x[y]

__eq__

x.__eq__(y) <==> x==y

__ge__

x.__ge__(y) <==> x>=y

__getitem__

x.__getitem__(y) <==> x[y]

__gt__

x.__gt__(y) <==> x>y

__hash__
__iadd__

x.__iadd__(y) <==> x+=y

__iand__

Update a intbitset with the intersection of itself and another.

__ior__

Update a intbitset with the union of itself and another.

__isub__

Remove all elements of another set from this set.

__iter__
__ixor__

Update an intbitset with the symmetric difference of itself and another.

__le__

x.__le__(y) <==> x<=y

__len__
__lt__

x.__lt__(y) <==> x<y

__ne__

x.__ne__(y) <==> x!=y

__new__(S, ...) → a new object with type S, a subtype of T
__nonzero__

x.__nonzero__() <==> x != 0

__or__

Return the union of two intbitsets as a new set. (i.e. all elements that are in either intbitsets.)

__pyx_vtable__ = <capsule object NULL>
__rand__

x.__rand__(y) <==> y&x

__reduce__()

helper for pickle

__repr__
__ror__

x.__ror__(y) <==> y|x

__rsub__

x.__rsub__(y) <==> y-x

__rxor__

x.__rxor__(y) <==> y^x

__safe_for_unpickling__ = True
__setitem__

x.__setitem__(i, y) <==> x[i]=y

__str__
__sub__

Return the difference of two intbitsets as a new set. (i.e. all elements that are in this intbitset but not the other.)

__xor__

Return the symmetric difference of two sets as a new set. (i.e. all elements that are in exactly one of the sets.)

add()

Add an element to a set. This has no effect if the element is already present.

clear()
copy()

Return a shallow copy of a set.

difference()

Return a new intbitset with elements from the intbitset that are not in the others.

difference_update()

Update the intbitset, removing elements found in others.

discard()

Remove an element from a intbitset if it is a member. If the element is not a member, do nothing.

extract_finite_list()

Return a finite list of elements sufficient to be passed to intbitset constructor toghether with the proper value of trailing_bits in order to reproduce this intbitset. At least up_to integer are looked for when they are inside the intbitset but not necessarily needed to build the intbitset

fastdump()

Return a compressed string representation suitable to be saved somewhere.

fastload()

Load a compressed string representation produced by a previous call to the fastdump method into the current intbitset. The previous content will be replaced.

get_allocated()
get_size()
get_wordbitsize()
get_wordbytsize()
intersection()

Return a new intbitset with elements common to the intbitset and all others.

intersection_update()

Update the intbitset, keeping only elements found in it and all others.

is_infinite()

Return True if the intbitset is infinite. (i.e. trailing_bits=True was used in the constructor.)

isdisjoint()

Return True if two intbitsets have a null intersection.

issubset()

Report whether another set contains this set.

issuperset()

Report whether this set contains another set.

pop()

Remove and return an arbitrary set element.

Note: intbitset implementation of .pop() differs from the native set
implementation by guaranteeing returning always the largest element.
remove()

Remove an element from a set; it must be a member. If the element is not a member, raise a KeyError.

strbits()

Return a string of 0s and 1s representing the content in memory of the intbitset.

symmetric_difference

Return the symmetric difference of two sets as a new set. (i.e. all elements that are in exactly one of the sets.)

symmetric_difference_update

Update an intbitset with the symmetric difference of itself and another.

tolist()

Legacy method to retrieve a list of all the elements inside an intbitset.

union()

Return a new intbitset with elements from the intbitset and all others.

union_update()

Update the intbitset, adding elements from all others.

update()

Update the intbitset, adding elements from all others.

update_with_signs()

Given a dictionary rhs whose keys are integers, remove all the integers whose value are less than 0 and add every integer whose value is 0 or more

Indices and tables