Kelvin Source

Unified Data Access

Legal service providers and their clients rely on a wide variety of content and file sharing systems to source material. On one hand, clients prefer "cloud" providers like Dropbox, Box, Google Drive, or OneDrive to share material. On the other hand, law firms and corporate legal departments have historically been slow to adopt "cloud", relying on local and networked file systems, SharePoint, or email.

Working across these different paradigms and systems often creates large frictions for data-driven tasks.

Transactional and litigation support groups must often dedicate substantial time under frustrating circumstances to manually download and upload files to and from these systems. This is a time-consuming and error-prone process that often leads to late nights, missed deadlines, and lost documents.

Kelvin Source is a data driver for the Kelvin Legal Data OS that provides a unified interface to these systems, allowing users to access content and files from any of these systems through a single interface. Users can build higher-level workflows on top of this abstraction, allowing for faster deliveries and greater re-use internally.

Kelvin Source makes it simple and seamless to work with large sets of files across local, internal network, and cloud systems.

Common Use Cases

Organizations often use Kelvin to accelerate many common ad-hoc tasks in transactions, e-Discovery, and investigations like:

Deduplication - Identify and remove duplicate files using fuzzy hashes, metadata, or other criteria
Identification - Identify unknown files using heuristics and machine learning models
Organization - Organize files into folders based on metadata, file type, or other criteria
Timelines - Create folder-based timelines of files based on metadata or other criteria
Deal Rooms - Quickly retrieve and archive deal room content from all major providers
Reusable Data Science - Write reusable workflows to automate common tasks across multiple systems

Features and Functionality

Kelvin Source currently supports the following types of data sources:

Memory Sources - In-Memory and Memory-Mapped Data
Filesystem Sources - File and Directory Sources
Archive Sources - ZIP and TAR Containers
Third-Party Sources - Common Local and Internet Sources
- HTTP(S)
- FTP
- SFTP
- WebDAV
- Samba/CIFS
Third-Party Sources - Common Sharing Sources
- Box.com
- Dropbox
- Google Drive
- OneDrive
- SharePoint

Status

Status: Early Access Preview

Kelvin Share is currently available through an Early Access Release program. Current and future functionality is summarized in the table below:

Source	Status	Authentication
In-Memory and Memory-Mapped	✅ Done	N/A
Locals Filesystem	✅ Done	N/A
Archive/Containers (ZIP, TAR, RAR, etc.)	✅ Done	N/A
Network Filesystems (CIFS/SMB, NFS)	✅ Done	User Account Service Account
Network Services (FTP, SFTP, HTTP(S))	✅ Done	Basic Auth Public Key Service Account
OneDrive (Office 365)	✅ Done	OAuth 2.0 Service Account
SharePoint (Office 365)	✅ Done	OAuth 2.0 Service Account
SharePoint (WebDAV)	⚙️ In Progress	User Account Service Account
Google Drive	✅ Done	OAuth 2.0 Service Account
Dropbox	✅ Done	OAuth 2.0
Box	✅ Done	OAuth 2.0
Email (IMAP/POP3, Office 365, Exchange)	⚙️ In Progress	User Account Service Account OAuth 2.0

Demos and Vignettes

The following Kelvin demos and vignettes demonstrate how Kelvin Source can be used:

Synchronize a deal room from Box.com
Identify near-duplicate files in a OneDrive folder
Create a timeline organized by file type
Organized Word documents by author

Examples

Seamless Source Objects

Kelvin Source provides a unified interface to all data sources, allowing users to access content and files from any of these systems through a single interface. Users can build higher-level workflows on top of this abstraction, allowing for faster deliveries and greater re-use internally.

The following example demonstrates how to use the Kelvin Source API to process files from common sources.

Example Source Usage

Source:

# Kelvin imports
from kelvin.source.source_object import SourceObject
from kelvin.source.filesystem_source import FilesystemSource
from kelvin.source.sftp_source import SftpSource
from kelvin.source.onedrive_source import BoxSource

# Create source object
source_object = SourceObject(
    name="Acme_Agreement_001.pdf",
    data=b"%PDF-1.4...",
    metadata={"myproperty": "myvalue"}
)

# Create filesystem source
filesystem_source = FilesystemSource(
    "Downloads\\Acme_Dealroom_Final\\", recursive=True
)

# Create SFTP source
sftp_source = SftpSource(
    "sftp://wiley:coyote@share.acme.com/dealroom/tnt/",
    path_filter="*.pdf",
    recursive=True
)

# Create Box.com source
box_source = BoxSource("https://acme-corp.app.box.com/folder/123412341234")

# Iterate over all sources with identical code
for source in [source_object, filesystem_source, sftp_source, box_source]:
    # Iterate over files in source
    for source_file in source:
        print(f"File: {source_file}")
        # ... do something with source file ...

Deduplicate and organize files in a folder

# use CTPH fuzzy hashing to deduplicate files
$ python3 -m kelvin.source.commands.deduplicate_path \
    --dry-run \
    --fuzzy \
    --num-threads 8 Investigation_Evidence_Folder/

Duplicate files for hash 9ac8d8ac361e33569aafeaf5ab9731b48731a191696c65313e1660cdc320c8d2096c7b01c8f387d3f983958ee52e72bdf2b11c9036708e6379ddacbb199f573e8e26e2f1f2917732e52796e08701c85530fd4caba4f4:
                Keeping Investigation_Evidence_Folder/Folder1/Document1.pdf
                Would delete Investigation_Evidence_Folder/Folder2/Doc1-Web-Upload.PDF

# organize files and directories based on file type, date, name, size, etc.
$ python3 -m kelvin.source.commands.organize_files \
    --dry-run \
    --fuzzy \
    --num-threads 8 Investigation_Evidence_Folder/

INFO:__main__:Would copy Investigation_Evidence_Folder/Folder1/Document1.pdf to Organized_Evidence/PDF/D/Document1.pdf
INFO:__main__:Would copy Investigation_Evidence_Folder/Folder3/SubfolderA/Doc2.pdf to Organized_Evidence/PDF/D/Doc2.pdf
...
INFO:__main__:Organized 85 files into 9 directories.

Box.com Dealroom

Source:

# Kelvin imports
from kelvin.data.sources.box_source import BoxSource

# Setup source (OAuth 2.0 authentication flow by default)
box_source = BoxSource("https://acme-corp.app.box.com/folder/123412341234")

# Iterate over files in folder (recursively by default)
for box_file in box_source:
    print(f"File: {box_file}")
    print(f"Metadata: {box_file.metadata}")
    print(f"Data: {box_file.data[0:4]}...")

Output:

File: SourceObject(
  name=Acme_Agreement_001.pdf,
  size=77966,
  hash=767e80e8...
)

Metadata:
{'id': '12341234',
 'name': 'Acme_Agreement_001.pdf',
 'description': '...',
 'size': 77966,
 'created_at': '2022-12-10T08:58:14-08:00',
 'modified_at': '2022-12-10T08:58:14-08:00'
 }


Data: b'%PDF'...

Filter Word docs on OneDrive

Source:

# Kelvin imports
from kelvin.data.sources.onedrive_source import (
  OneDriveSource,
  CachingAzureCredential,
)

# Setup credential
# * OAuth 2.0
# * Service account
# * Any valid Azure credential mechanism
azure_credential = CachingAzureCredential(
  credential_data="...",
  token_data="...",
  credential_path="...",
)

# Setup source
onedrive_source = OneDriveSource(
  "onedrive:///01234ABCD01234ABCD01234ABCD",
  path_filter="*.doc*",
  credential=azure_credential
)

# Iterate over files in folder (recursively by default)
for onedrive_file in onedrive_source:
    print(f"File: {onedrive_file}")
    # Use Kelvin Convert and Kelvin NLP to analyze

Output:

File: SourceObject(
  name=Clean_Screen_Policy_v1.3.docx,
  size=24680,
  hash=123edc1a...
)

Kelvin Source

Unified Data Access #

Common Use Cases #

Features and Functionality #

Status #

Demos and Vignettes #

Examples #

Seamless Source Objects #

Example Source Usage #

Deduplicate and organize files in a folder #

Box.com Dealroom #

Filter Word docs on OneDrive #

Unified Data Access

Common Use Cases

Features and Functionality

Status

Demos and Vignettes

Examples

Seamless Source Objects

Example Source Usage

Deduplicate and organize files in a folder

Box.com Dealroom

Filter Word docs on OneDrive