--- title: LinkOrgs Online emoji: 🤝 colorFrom: blue colorTo: yellow sdk: docker pinned: false license: mit short_description: An online interface to LinkOrgs models. --- ## Overview LinkOrgs Online is a user-friendly web application hosted on Hugging Face Spaces that provides an interactive interface to the [LinkOrgs](https://github.com/cjerzak/LinkOrgs-software) package. LinkOrgs is an open-source tool for linking and embedding organizational names, leveraging half a billion open-collaborated records from LinkedIn to assist in record linkage tasks. Built with R Shiny and Docker, this space allows users to upload organizational names (via CSV or text input), generate high-dimensional embeddings using machine learning models, perform optional PCA dimensionality reduction, and download results. It's ideal for social scientists, data analysts, and researchers dealing with entity resolution for organizations (e.g., matching "MIT" to "Massachusetts Institute of Technology" across datasets). For technical details on the underlying methods, see the [project summary](https://connorjerzak.com/linkorgs-summary/) or the paper: Libgober, B., & Jerzak, C. T. (2024). "Linking Datasets on Organizations Using Half-a-Billion Open-Collaborated Records." *Political Science Research and Methods*. ## Features - **Input Options**: Upload a CSV file or paste organization names directly (one per line). - **Column Detection**: Automatic guessing of the organization name column in CSVs, with manual override. - **Embedding Generation**: Uses LinkOrgs ML models (versions v1–v4) to produce embeddings (typically 1024 dimensions). - **Progress Tracking**: Real-time updates and confirmations for large inputs (>1000 rows). - **Output Previews**: Interactive data tables for input, embeddings, and PCA results. - **Dimensionality Reduction**: Optional PCA to 2D or 10D, with scatter plots and CSV downloads. - **Exports**: Download full embeddings or PCA results as CSVs, with options to include original data. - **UI Enhancements**: Modern Bootstrap theme, tooltips, help modals, and accessibility features. - **Limitations Handling**: Caps processing at user-defined max rows (default 5000); warns about computation time. ## Setup and Requirements This space runs in a Docker environment with R and Python (via reticulate) pre-configured. Key dependencies: - **R Packages**: shiny, bslib, DT, shinyWidgets (loaded in the app). - **Python Integration**: Uses reticulate to interface with LinkOrgs (requires a conda environment like "LinkOrgs_env"). - **LinkOrgs Package**: Must be installed in the Python environment; the app guides installation if missing. - **Hardware Notes**: First-time use may download models (internet required). Processing large datasets can take 1–10 minutes; avoid navigating away during computation. For local development or customization: 1. Clone the repository from Hugging Face. 2. Ensure Docker is installed. 3. Run the app locally with `shiny::runApp()` after setting up the conda environment (see code for `configure_python()` function). Check the [Hugging Face Spaces config reference](https://huggingface.co/docs/hub/spaces-config-reference) for deployment details. For see how to use the package direction in R, check out the LinkOrgs [GitHub](https://github.com/cjerzak/LinkOrgs-software). ## Usage Guide ### Step 1: Input Data - **CSV Upload**: Select "CSV upload", upload your file, and choose the column with organization names (auto-detected). - **Text Paste**: Select "Text paste", enter names (one per line), or load examples. - Set advanced options: Max rows, include original names in output, ML model version (default v4). ### Step 2: Generate Embeddings - Click "Process Names". - For inputs >1000 rows, confirm to proceed. - Monitor progress; results appear in the "Generate embeddings" card. - Preview the first 5 rows and download the full CSV (columns: original data + `emb_001` to `emb_N`). ### Step 3: Optional PCA - Once embeddings are ready, click "Compute PCA". - View a 2D scatter plot. - Download 2D or 10D PCA CSVs (columns: `PC1`, `PC2`, ... + optional name column). ### Tips - Embeddings are numeric vectors capturing semantic similarities between organization names, trained on LinkedIn data. - Use v4 for best performance in most cases. - For very large datasets, process in batches to avoid timeouts. ## Code Structure The app is a single-file R Shiny application (`app.R`). Key sections: - **Python Configuration**: Handles conda environments for cross-platform compatibility (e.g., Docker). - **Helpers**: Functions for column guessing, text parsing, column renaming, and notifications. - **UI**: Sidebar for inputs/options; main panels for previews, results, and PCA. - **Server Logic**: Reactive data handling, embedding generation via `LinkOrgs::LinkOrgs()`, PCA with `prcomp()`, and downloads. Full code is available in the repository. Contributions welcome via pull requests! ## Underlying Methodology LinkOrgs uses LinkedIn's vast network for organizational record linkage: - **ML Approach**: Treats LinkedIn as a training set for name embeddings (e.g., 1024D vectors). - **Graph-Theoretic Methods**: Models name references as networks for clustering. - **Hybrid**: Combines ML and graphs for robust matching. Training data and extensions (e.g., for personal names) are available via the GitHub repo. See the paper for benchmarks showing superior performance in real-world tasks. ## License and Citations - **License**: MIT. - **Citation**: Use the provided [.bib file](https://connorjerzak.com/wp-content/uploads/2024/07/LinkOrgsBib.txt) or cite the paper directly. See [arxiv.org/abs/2302.02533](https://arxiv.org/abs/2302.02533) for the ArXiv version. - **Acknowledgements**: Built by Connor T. Jerzak et al. Hosted on Hugging Face. If you encounter issues, check console logs or open an issue on GitHub. For more, visit [the project page](https://github.com/cjerzak/LinkOrgs-software).