Skip to content
This repository has been archived by the owner on Nov 22, 2022. It is now read-only.

US Department of Transportation (USDOT) Intelligent Transportation Systems Secure Data Commons (ITS SDC). Connected Vehicle Pilots (CVP) tools to support data ingest into the Data Lake.

License

Notifications You must be signed in to change notification settings

USDOT-SDC-Archive/sdc-dot-cvp-ingest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build Status Quality Gate Status Coverage

sdc-dot-cvp-ingest

US Department of Transportation (USDOT) Intelligent Transportation Systems Secure Data Commons (ITS SDC). Connected Vehicle Pilots (CVP) tools to support data ingest into the Data Lake.

The Secure Data Commons (SDC) is a cloud-based analytics platform that enables access to traffic engineers, researchers, and data scientists to various transportation related datasets. The SDC platform is a prototype created as part of the U.S. Department of Transportation (USDOT) research project. The objective of this prototype is to provide a secure platform, which will enable USDOT and the broader transportation sector to share and collaborate their research, tools, algorithms, analysis, and more around sensitive datasets using modern, commercially available tools without the need to install tools or software locally. Secure Data Commons (SDC) enables collaborative but controlled integration and analysis of research data at the moderate sensitivity level (PII & CBI).

The SDC platform allows users to conduct analyses and do development and testing of new tools and software products. It is not intended to be an alternative to any local jurisdiction’s traffic management center or local data repository. The existing SDC provides users with the following data, tools, and features:

  • Data: The SDC is ingesting several datasets currently. Additional data sets will be added to the environment over time.
  • Tools: The environment provides access to open source tools including Python, RStudio, Microsoft R, SQL Workbench, Power BI, Jupyter Notebook, and others. These tools are available on a virtual machine in the system enabling data analytics in the cloud.
  • Functionality: Users can access and analyze data within the environment, save their work to a virtual machine, and publish processes and results to share with others.

The SDC platform supports two major roles:

  • Data Providers: These are entities that provide data hosted on the SDC platform. The data provider establishes the data protection needs and acceptable use terms for the data analysts.
  • Data Analysts: These are entities that conduct analysis of the datasets hosted in the SDC system. Note that analysts can bring their own data and tools into the SDC system.

Table of Contents

I. Release Notes

II. Usage Example

III. Configuration

IV. Installation

V. Design and Architecture

VI. Unit Tests

VII. File Manifest

VIII. Development Setup

IX. Release History

X. Contact Information

XI. Contributing

XII. Known Bugs

XIII. Credits and Acknowledgment

XIV. CODE.GOV Registration Info

August 13, 2020. SDC sdc-cvp-ingest Release 1.0

  • Import/reconcile additional manually created resources with Terraform
  • Configuration for Kinesis Firehose Delivery Streams are uniform
  • Update tags and resource descriptions to match naming conventions

August 7, 2020. SDC sdc-cvp-ingest Release 1.0

  • Import/reconcile manually created resources with Terraform
  • Configuration for Lambdas are uniform
  • Update tags to match proper team

The following diagram represents a high level overview of the SDC Platform:

SDC System Diagram

Looking from the bottom up, the ITS ODE service performs near-real time data ingest via Kinesis Firehose, while data ingest trhough S3 ingest buckets are done either with automated scripts or manually.

AWS S3 Data Ingest Repository

There are 2 methods of ingesting data sets into the SDC: near-real time ingest through a Kinesis Firehose endpoint, and data ingest through an S3 ingest bucket.

For a Kinesis Firehose ingest, data files are copied directly into a Data Lake S3-based message repository according to Firehose's configuration. For an S3 ingest, data files are uploaded into an S3 ingest bucket and moved into the Data Lake with a Lambda function.

This repository contains Lambda function implementation for the S3 data ingest flow as well as unit test and corresponding scrits to exercise this function.

Installing Python on build box

You can run the following on the ECS build box to install a different version of python (e.g. 3.7.9):

# assuming you are SSHed as ec2-user
sudo su
cd ~/

# make sure libffi is installed
yum install libffi-devel

# install python 3.7
curl -O https://www.python.org/ftp/python/3.7.9/Python-3.7.9.tgz
tar -xzf Python-3.7.9.tgz
cd Python-3.7.9
./configure --enable-optimizations
make altinstall

# and now you can use python3.7 as an alias

For any queries you can reach to [email protected]

Thank you to the Department of Transportation for funding to develop this project.

Agency: DOT

Short Description: US Department of Transportation (USDOT) Intelligent Transportation Systems Secure Data Commons (ITS SDC). Connected Vehicle Pilots (CVP) tools to support data ingest into the Data Lake.

Status: Beta

Tags: transportation, connected vehicles, intelligent transportation systems

Labor Hours:

Contact Name: [email protected]

About

US Department of Transportation (USDOT) Intelligent Transportation Systems Secure Data Commons (ITS SDC). Connected Vehicle Pilots (CVP) tools to support data ingest into the Data Lake.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published