Kelvin Source

Unified Data Access

Legal service providers and their clients rely on a wide variety of content and file sharing systems to source material. On one hand, clients prefer "cloud" providers like Dropbox, Box, Google Drive, or OneDrive to share material. On the other hand, law firms and corporate legal departments have historically been slow to adopt "cloud", relying on local and networked file systems, SharePoint, or email.

Working across these different paradigms and systems often creates large frictions for data-driven tasks.

Transactional and litigation support groups must often dedicate substantial time under frustrating circumstances to manually download and upload files to and from these systems. This is a time-consuming and error-prone process that often leads to late nights, missed deadlines, and lost documents.

Kelvin Source is a data driver for the Kelvin Legal Data OS that provides a unified interface to these systems, allowing users to access content and files from any of these systems through a single interface. Users can build higher-level workflows on top of this abstraction, allowing for faster deliveries and greater re-use internally.

Kelvin Source makes it simple and seamless to work with large sets of files across local, internal network, and cloud systems.

Common Use Cases

Organizations often use Kelvin to accelerate many common ad-hoc tasks in transactions, e-Discovery, and investigations like:

  • Deduplication - Identify and remove duplicate files using fuzzy hashes, metadata, or other criteria
  • Identification - Identify unknown files using heuristics and machine learning models
  • Organization - Organize files into folders based on metadata, file type, or other criteria
  • Timelines - Create folder-based timelines of files based on metadata or other criteria
  • Deal Rooms - Quickly retrieve and archive deal room content from all major providers
  • Reusable Data Science - Write reusable workflows to automate common tasks across multiple systems

Features and Functionality

Kelvin Source currently supports the following types of data sources:

  • Memory Sources - In-Memory and Memory-Mapped Data
  • Filesystem Sources - File and Directory Sources
  • Archive Sources - ZIP and TAR Containers
  • Third-Party Sources - Common Local and Internet Sources
    • HTTP(S)
    • FTP
    • SFTP
    • WebDAV
    • Samba/CIFS
  • Third-Party Sources - Common Sharing Sources
    • Box.com
    • Dropbox
    • Google Drive
    • OneDrive
    • SharePoint

Status

Status: Early Access Preview

Kelvin Share is currently available through an Early Access Release program. Current and future functionality is summarized in the table below:

SourceStatusAuthentication
In-Memory and Memory-Mapped✅ DoneN/A
Locals Filesystem✅ DoneN/A
Archive/Containers (ZIP, TAR, RAR, etc.)✅ DoneN/A
Network Filesystems (CIFS/SMB, NFS)✅ DoneUser Account
Service Account
Network Services (FTP, SFTP, HTTP(S))✅ DoneBasic Auth
Public Key
Service Account
OneDrive (Office 365)✅ DoneOAuth 2.0
Service Account
SharePoint (Office 365)✅ DoneOAuth 2.0
Service Account
SharePoint (WebDAV)⚙️ In ProgressUser Account
Service Account
Google Drive✅ DoneOAuth 2.0
Service Account
Dropbox✅ DoneOAuth 2.0
Box✅ DoneOAuth 2.0
Email (IMAP/POP3, Office 365, Exchange)⚙️ In ProgressUser Account
Service Account
OAuth 2.0

Demos and Vignettes

The following Kelvin demos and vignettes demonstrate how Kelvin Source can be used:

Examples

Seamless Source Objects

Kelvin Source provides a unified interface to all data sources, allowing users to access content and files from any of these systems through a single interface. Users can build higher-level workflows on top of this abstraction, allowing for faster deliveries and greater re-use internally.

The following example demonstrates how to use the Kelvin Source API to process files from common sources.

Example Source Usage

Source:

# Kelvin imports
from kelvin.source.source_object import SourceObject
from kelvin.source.filesystem_source import FilesystemSource
from kelvin.source.sftp_source import SftpSource
from kelvin.source.onedrive_source import BoxSource

# Create source object
source_object = SourceObject(
    name="Acme_Agreement_001.pdf",
    data=b"%PDF-1.4...",
    metadata={"myproperty": "myvalue"}
)

# Create filesystem source
filesystem_source = FilesystemSource(
    "Downloads\\Acme_Dealroom_Final\\", recursive=True
)

# Create SFTP source
sftp_source = SftpSource(
    "sftp://wiley:coyote@share.acme.com/dealroom/tnt/",
    path_filter="*.pdf",
    recursive=True
)

# Create Box.com source
box_source = BoxSource("https://acme-corp.app.box.com/folder/123412341234")

# Iterate over all sources with identical code
for source in [source_object, filesystem_source, sftp_source, box_source]:
    # Iterate over files in source
    for source_file in source:
        print(f"File: {source_file}")
        # ... do something with source file ...

Deduplicate and organize files in a folder

# use CTPH fuzzy hashing to deduplicate files
$ python3 -m kelvin.source.commands.deduplicate_path \
    --dry-run \
    --fuzzy \
    --num-threads 8 Investigation_Evidence_Folder/

Duplicate files for hash 9ac8d8ac361e33569aafeaf5ab9731b48731a191696c65313e1660cdc320c8d2096c7b01c8f387d3f983958ee52e72bdf2b11c9036708e6379ddacbb199f573e8e26e2f1f2917732e52796e08701c85530fd4caba4f4:
                Keeping Investigation_Evidence_Folder/Folder1/Document1.pdf
                Would delete Investigation_Evidence_Folder/Folder2/Doc1-Web-Upload.PDF

# organize files and directories based on file type, date, name, size, etc.
$ python3 -m kelvin.source.commands.organize_files \
    --dry-run \
    --fuzzy \
    --num-threads 8 Investigation_Evidence_Folder/

INFO:__main__:Would copy Investigation_Evidence_Folder/Folder1/Document1.pdf to Organized_Evidence/PDF/D/Document1.pdf
INFO:__main__:Would copy Investigation_Evidence_Folder/Folder3/SubfolderA/Doc2.pdf to Organized_Evidence/PDF/D/Doc2.pdf
...
INFO:__main__:Organized 85 files into 9 directories.

Box.com Dealroom

Source:

# Kelvin imports
from kelvin.data.sources.box_source import BoxSource

# Setup source (OAuth 2.0 authentication flow by default)
box_source = BoxSource("https://acme-corp.app.box.com/folder/123412341234")

# Iterate over files in folder (recursively by default)
for box_file in box_source:
    print(f"File: {box_file}")
    print(f"Metadata: {box_file.metadata}")
    print(f"Data: {box_file.data[0:4]}...")

Output:

File: SourceObject(
  name=Acme_Agreement_001.pdf,
  size=77966,
  hash=767e80e8...
)

Metadata:
{'id': '12341234',
 'name': 'Acme_Agreement_001.pdf',
 'description': '...',
 'size': 77966,
 'created_at': '2022-12-10T08:58:14-08:00',
 'modified_at': '2022-12-10T08:58:14-08:00'
 }


Data: b'%PDF'...

Filter Word docs on OneDrive

Source:

# Kelvin imports
from kelvin.data.sources.onedrive_source import (
  OneDriveSource,
  CachingAzureCredential,
)

# Setup credential
# * OAuth 2.0
# * Service account
# * Any valid Azure credential mechanism
azure_credential = CachingAzureCredential(
  credential_data="...",
  token_data="...",
  credential_path="...",
)

# Setup source
onedrive_source = OneDriveSource(
  "onedrive:///01234ABCD01234ABCD01234ABCD",
  path_filter="*.doc*",
  credential=azure_credential
)

# Iterate over files in folder (recursively by default)
for onedrive_file in onedrive_source:
    print(f"File: {onedrive_file}")
    # Use Kelvin Convert and Kelvin NLP to analyze

Output:

File: SourceObject(
  name=Clean_Screen_Policy_v1.3.docx,
  size=24680,
  hash=123edc1a...
)