Kelvin Source
Unified Data Access
Legal service providers and their clients rely on a wide variety of content and file sharing systems to source material. On one hand, clients prefer "cloud" providers like Dropbox, Box, Google Drive, or OneDrive to share material. On the other hand, law firms and corporate legal departments have historically been slow to adopt "cloud", relying on local and networked file systems, SharePoint, or email.
Working across these different paradigms and systems often creates large frictions for data-driven tasks.
Transactional and litigation support groups must often dedicate substantial time under frustrating circumstances to manually download and upload files to and from these systems. This is a time-consuming and error-prone process that often leads to late nights, missed deadlines, and lost documents.
Kelvin Source is a data driver for the Kelvin Legal Data OS that provides a unified interface to these systems, allowing users to access content and files from any of these systems through a single interface. Users can build higher-level workflows on top of this abstraction, allowing for faster deliveries and greater re-use internally.
Kelvin Source makes it simple and seamless to work with large sets of files across local, internal network, and cloud systems.
Common Use Cases
Organizations often use Kelvin to accelerate many common ad-hoc tasks in transactions, e-Discovery, and investigations like:
- Deduplication - Identify and remove duplicate files using fuzzy hashes, metadata, or other criteria
- Identification - Identify unknown files using heuristics and machine learning models
- Organization - Organize files into folders based on metadata, file type, or other criteria
- Timelines - Create folder-based timelines of files based on metadata or other criteria
- Deal Rooms - Quickly retrieve and archive deal room content from all major providers
- Reusable Data Science - Write reusable workflows to automate common tasks across multiple systems
Features and Functionality
Kelvin Source currently supports the following types of data sources:
- Memory Sources - In-Memory and Memory-Mapped Data
- Filesystem Sources - File and Directory Sources
- Archive Sources - ZIP and TAR Containers
- Third-Party Sources - Common Local and Internet Sources
- HTTP(S)
- FTP
- SFTP
- WebDAV
- Samba/CIFS
- Third-Party Sources - Common Sharing Sources
- Box.com
- Dropbox
- Google Drive
- OneDrive
- SharePoint
Status
Status: Early Access Preview
Kelvin Share is currently available through an Early Access Release program. Current and future functionality is summarized in the table below:
Source | Status | Authentication |
---|---|---|
In-Memory and Memory-Mapped | ✅ Done | N/A |
Locals Filesystem | ✅ Done | N/A |
Archive/Containers (ZIP, TAR, RAR, etc.) | ✅ Done | N/A |
Network Filesystems (CIFS/SMB, NFS) | ✅ Done | User Account Service Account |
Network Services (FTP, SFTP, HTTP(S)) | ✅ Done | Basic Auth Public Key Service Account |
OneDrive (Office 365) | ✅ Done | OAuth 2.0 Service Account |
SharePoint (Office 365) | ✅ Done | OAuth 2.0 Service Account |
SharePoint (WebDAV) | ⚙️ In Progress | User Account Service Account |
Google Drive | ✅ Done | OAuth 2.0 Service Account |
Dropbox | ✅ Done | OAuth 2.0 |
Box | ✅ Done | OAuth 2.0 |
Email (IMAP/POP3, Office 365, Exchange) | ⚙️ In Progress | User Account Service Account OAuth 2.0 |
Demos and Vignettes
The following Kelvin demos and vignettes demonstrate how Kelvin Source can be used:
- Synchronize a deal room from Box.com
- Identify near-duplicate files in a OneDrive folder
- Create a timeline organized by file type
- Organized Word documents by author
Examples
Seamless Source Objects
Kelvin Source provides a unified interface to all data sources, allowing users to access content and files from any of these systems through a single interface. Users can build higher-level workflows on top of this abstraction, allowing for faster deliveries and greater re-use internally.
The following example demonstrates how to use the Kelvin Source API to process files from common sources.
Example Source Usage
Source:
# Kelvin imports
from kelvin.source.source_object import SourceObject
from kelvin.source.filesystem_source import FilesystemSource
from kelvin.source.sftp_source import SftpSource
from kelvin.source.onedrive_source import BoxSource
# Create source object
source_object = SourceObject(
name="Acme_Agreement_001.pdf",
data=b"%PDF-1.4...",
metadata={"myproperty": "myvalue"}
)
# Create filesystem source
filesystem_source = FilesystemSource(
"Downloads\\Acme_Dealroom_Final\\", recursive=True
)
# Create SFTP source
sftp_source = SftpSource(
"sftp://wiley:coyote@share.acme.com/dealroom/tnt/",
path_filter="*.pdf",
recursive=True
)
# Create Box.com source
box_source = BoxSource("https://acme-corp.app.box.com/folder/123412341234")
# Iterate over all sources with identical code
for source in [source_object, filesystem_source, sftp_source, box_source]:
# Iterate over files in source
for source_file in source:
print(f"File: {source_file}")
# ... do something with source file ...
Deduplicate and organize files in a folder
# use CTPH fuzzy hashing to deduplicate files
$ python3 -m kelvin.source.commands.deduplicate_path \
--dry-run \
--fuzzy \
--num-threads 8 Investigation_Evidence_Folder/
Duplicate files for hash 9ac8d8ac361e33569aafeaf5ab9731b48731a191696c65313e1660cdc320c8d2096c7b01c8f387d3f983958ee52e72bdf2b11c9036708e6379ddacbb199f573e8e26e2f1f2917732e52796e08701c85530fd4caba4f4:
Keeping Investigation_Evidence_Folder/Folder1/Document1.pdf
Would delete Investigation_Evidence_Folder/Folder2/Doc1-Web-Upload.PDF
# organize files and directories based on file type, date, name, size, etc.
$ python3 -m kelvin.source.commands.organize_files \
--dry-run \
--fuzzy \
--num-threads 8 Investigation_Evidence_Folder/
INFO:__main__:Would copy Investigation_Evidence_Folder/Folder1/Document1.pdf to Organized_Evidence/PDF/D/Document1.pdf
INFO:__main__:Would copy Investigation_Evidence_Folder/Folder3/SubfolderA/Doc2.pdf to Organized_Evidence/PDF/D/Doc2.pdf
...
INFO:__main__:Organized 85 files into 9 directories.
Box.com Dealroom
Source:
# Kelvin imports
from kelvin.data.sources.box_source import BoxSource
# Setup source (OAuth 2.0 authentication flow by default)
box_source = BoxSource("https://acme-corp.app.box.com/folder/123412341234")
# Iterate over files in folder (recursively by default)
for box_file in box_source:
print(f"File: {box_file}")
print(f"Metadata: {box_file.metadata}")
print(f"Data: {box_file.data[0:4]}...")
Output:
File: SourceObject(
name=Acme_Agreement_001.pdf,
size=77966,
hash=767e80e8...
)
Metadata:
{'id': '12341234',
'name': 'Acme_Agreement_001.pdf',
'description': '...',
'size': 77966,
'created_at': '2022-12-10T08:58:14-08:00',
'modified_at': '2022-12-10T08:58:14-08:00'
}
Data: b'%PDF'...
Filter Word docs on OneDrive
Source:
# Kelvin imports
from kelvin.data.sources.onedrive_source import (
OneDriveSource,
CachingAzureCredential,
)
# Setup credential
# * OAuth 2.0
# * Service account
# * Any valid Azure credential mechanism
azure_credential = CachingAzureCredential(
credential_data="...",
token_data="...",
credential_path="...",
)
# Setup source
onedrive_source = OneDriveSource(
"onedrive:///01234ABCD01234ABCD01234ABCD",
path_filter="*.doc*",
credential=azure_credential
)
# Iterate over files in folder (recursively by default)
for onedrive_file in onedrive_source:
print(f"File: {onedrive_file}")
# Use Kelvin Convert and Kelvin NLP to analyze
Output:
File: SourceObject(
name=Clean_Screen_Policy_v1.3.docx,
size=24680,
hash=123edc1a...
)