paraphernalia.torch.clip module

Evaluate images with CLIP.

class CLIP(prompt, anti_prompt=None, detail=None, use_tiling=True, macro=0.5, chops=64, model='ViT-B/32', device=None)[source]

A CLIP-based perceptor that evaluates how well an image fits with on or more target text prompts.

The underlying model is limited to (224, 224) resolution, so this class presents it with multiple perspectives on an image:

  • Macro: random crops of 90-100% of the image, used to counteract aliasing

  • Micro: small near-pixel-perfect random crops, and an optional tiling to enable the fine details of high-resolution images to be processed.

A lot of internals are exposed via methods to facilitate debugging and experimentation.

Parameters
  • prompt (Union[str, List[str]]) – the text prompt to use in general

  • anti_prompt (Optional[Union[str, List[str]]]) – a description to avoid

  • detail (Optional[Union[str, List[str]]]) – a text prompt to use for micro-perception, defaults to “A detail from a picture of {prompt}”

  • use_tiling (bool) –

  • macro (float) –

  • chops (int) –

  • model (str) –

  • device (Optional[str]) –

use_tiling

if true, add a covering of near-pixel-perfect perceptors into the mix

Type

bool

chops

augmentation operations, these get split 50-50 between macro and micro

Type

int

Initializes internal Module state, shared by both nn.Module and ScriptModule.

encode_text(text_or_texts)[source]

Encode text.

Returns a detached tensor.

Parameters

text_or_texts (str) –

Return type

torch.Tensor

encode_image(batch)[source]

Encode an image.

Does not detach.

Parameters

batch (torch.Tensor) –

Return type

torch.Tensor

get_macro(img)[source]

Get a set of high-level views on an image batch.

Parameters

img (Tensor) – A (b, c, h, w) image batch

Returns

an expanded (b, c, h, w) image batch

Return type

Tensor

get_micro(img)[source]

Get a set of detailed (near pixel-perfect) views on an image batch.

Parameters

img (Tensor) – A (b, c, h, w) image batch

Returns

an expanded (b, c, h, w) image batch

Return type

Tensor

get_similarity(img, prompts, batch_size, match='all')[source]

Compute the average similarity between a combined but contiguous batch of images and set of prompts.

Parameters
  • imgs (Tensor) – A combined-but-contiguous image batch with shape (batch_size * t, c, h, w)

  • prompts (Tensor) – A tensor of prompt embeddings with shape (n, 512)

  • batch_size (int) – The size of the original image batch

  • match (str) – Policy for multiple prompts. “any”, “all” or (in future) “one”

  • img (torch.Tensor) –

Returns

A tensor of average similarities with shape (batch_size,)

Return type

Tensor

forward(img)[source]

Returns a similarity (0, 1) for each image in the provided batch.

TODO:

- Enable micro/macro weighting beyond what we get naturally from chops
- Add some kind of masking
Parameters

img (Tensor) – A (b, c, h, w) image tensor

Returns

A vector of size b

Return type

Tensor