registry/.github/scripts/check_registry_site_health.sh
Michael Smith 376664ca04
chore: migrate all Coder modules to Registry repo (#4)
Addresses part of https://github.com/coder/internal/issues/532 (but doesn't fully close it out).

This is a huge PR, but chunking it up seemed pointless, since we're largely copying over existing files. I'm going to try commenting on the main areas I think are worth paying attention to

## Changes made
- Migrated over all Coder modules from coder/modules, and put them in their correct user namespaces.
- Added README.md files to all newly-created user namespaces
- Updated all image paths for every image used, to make sure they don't break (I switched the paths programmatically, and then manually verified every README to guarantee this).
- Added README.md files for each contributor who previously made a module
- Updated our `tsconfig.json` file to use modern libraries when run from the server, and made a custom `tsconfig.json` just for the `windows-rdp` module (which has more restrictive browser concerns)
- Added CI step to run all the module tests we currently have

## Notes
- There were a lot of Bash script files that weren't carried over in this PR, partly because I don't know Bash well enough to know (1) whether they're still needed, or (2) modify them to account for the new file structure. Those can be brought over later.
- We had a `lint.ts` file that provided some light validation of some of the modules. After going through it, there were so many bugs and issues with the code that I legitimately think that it barely provided a safety net at all. I got rid of it entirely, with the intention of adding the functionality that was originally intended to the current validation logic (to be handled in a separate PR).
- I changed how we set up the `.images` directory, because it felt like it would be chaos if a bunch of users try to throw all their images in one giant directory, with no guidelines on how to do it. I instead made it so that images should be scoped by namespace, which felt a lot more manageable. The `.icons` directory is still at the top level, because realistically, there are only going to be so many types of icons referenced, so it's fine for those to be shared.
- I don't think the `maintainer_github` and `contributor_github` fields make sense anymore, but those can be stripped out once we've updated the Registry site build step to use the new Registry repo
   - My gut instinct is to use the user namespace to determine the main owner of the module, and then add a `contributors` string list to indicate which other users have contributed meaningfully to it. We can then add validation to make sure that every value in that list exists as another namespace in the repo

## What still needs to be migrated (in separate PRs)
These are the main files of interest that still probably need to be copied over from the `/modules` repo:
- `new.sh`
- `terraform_validate.sh`
- `update-version.sh`

They're probably going to require enough changes that it's worth handling them in a separate PR.
2025-04-17 09:49:54 -04:00

205 lines
7.5 KiB
Bash

#!/usr/bin/env bash
set -o pipefail
set -u
VERBOSE="${VERBOSE:-0}"
if [[ "${VERBOSE}" -ne "0" ]]; then
set -x
fi
# List of required environment variables
required_vars=(
"INSTATUS_API_KEY"
"INSTATUS_PAGE_ID"
"INSTATUS_COMPONENT_ID"
"VERCEL_API_KEY"
)
# Check if each required variable is set
for var in "${required_vars[@]}"; do
if [[ -z "${!var:-}" ]]; then
echo "Error: Environment variable '$var' is not set."
exit 1
fi
done
REGISTRY_BASE_URL="${REGISTRY_BASE_URL:-https://registry.coder.com}"
status=0
declare -a modules=()
declare -a failures=()
# Collect all module directories containing a main.tf file
for path in $(find . -maxdepth 2 -not -path '*/.*' -type f -name main.tf | cut -d '/' -f 2 | sort -u); do
modules+=("${path}")
done
echo "Checking modules: ${modules[*]}"
# Function to update the component status on Instatus
update_component_status() {
local component_status=$1
# see https://instatus.com/help/api/components
(curl -X PUT "https://api.instatus.com/v1/$INSTATUS_PAGE_ID/components/$INSTATUS_COMPONENT_ID" \
-H "Authorization: Bearer $INSTATUS_API_KEY" \
-H "Content-Type: application/json" \
-d "{\"status\": \"$component_status\"}")
}
# Function to create an incident
create_incident() {
local incident_name="Degraded Service"
local message="The following modules are experiencing issues:\n"
for i in "${!failures[@]}"; do
message+="$((i + 1)). ${failures[$i]}\n"
done
component_status="PARTIALOUTAGE"
if ((${#failures[@]} == ${#modules[@]})); then
component_status="MAJOROUTAGE"
fi
# see https://instatus.com/help/api/incidents
incident_id=$(curl -s -X POST "https://api.instatus.com/v1/$INSTATUS_PAGE_ID/incidents" \
-H "Authorization: Bearer $INSTATUS_API_KEY" \
-H "Content-Type: application/json" \
-d "{
\"name\": \"$incident_name\",
\"message\": \"$message\",
\"components\": [\"$INSTATUS_COMPONENT_ID\"],
\"status\": \"INVESTIGATING\",
\"notify\": true,
\"statuses\": [
{
\"id\": \"$INSTATUS_COMPONENT_ID\",
\"status\": \"PARTIALOUTAGE\"
}
]
}" | jq -r '.id')
echo "Created incident with ID: $incident_id"
}
# Function to check for existing unresolved incidents
check_existing_incident() {
# Fetch the latest incidents with status not equal to "RESOLVED"
local unresolved_incidents=$(curl -s -X GET "https://api.instatus.com/v1/$INSTATUS_PAGE_ID/incidents" \
-H "Authorization: Bearer $INSTATUS_API_KEY" \
-H "Content-Type: application/json" | jq -r '.incidents[] | select(.status != "RESOLVED") | .id')
if [[ -n "$unresolved_incidents" ]]; then
echo "Unresolved incidents found: $unresolved_incidents"
return 0 # Indicate that there are unresolved incidents
else
echo "No unresolved incidents found."
return 1 # Indicate that no unresolved incidents exist
fi
}
force_redeploy_registry() {
# These are not secret values; safe to just expose directly in script
local VERCEL_TEAM_SLUG="codercom"
local VERCEL_TEAM_ID="team_tGkWfhEGGelkkqUUm9nXq17r"
local VERCEL_APP="registry"
local latest_res
latest_res=$(
curl "https://api.vercel.com/v6/deployments?app=$VERCEL_APP&limit=1&slug=$VERCEL_TEAM_SLUG&teamId=$VERCEL_TEAM_ID&target=production&state=BUILDING,INITIALIZING,QUEUED,READY" \
--fail \
--silent \
--header "Authorization: Bearer $VERCEL_API_KEY" \
--header "Content-Type: application/json"
)
# If we have zero deployments, something is VERY wrong. Make the whole
# script exit with a non-zero status code
local latest_id
latest_id=$(echo "${latest_res}" | jq -r '.deployments[0].uid')
if [[ "${latest_id}" = "null" ]]; then
echo "Unable to pull any previous deployments for redeployment"
echo "Please redeploy the latest deployment manually in Vercel."
echo "https://vercel.com/codercom/registry/deployments"
exit 1
fi
local latest_date_ts_seconds
latest_date_ts_seconds=$(echo "${latest_res}" | jq -r '.deployments[0].createdAt/1000|floor')
local current_date_ts_seconds
current_date_ts_seconds="$(date +%s)"
local max_redeploy_interval_seconds=7200 # 2 hours
if ((current_date_ts_seconds - latest_date_ts_seconds < max_redeploy_interval_seconds)); then
echo "The registry was deployed less than 2 hours ago."
echo "Not automatically re-deploying the regitstry."
echo "A human reading this message should decide if a redeployment is necessary."
echo "Please check the Vercel dashboard for more information."
echo "https://vercel.com/codercom/registry/deployments"
exit 1
fi
local latest_deployment_state
latest_deployment_state="$(echo "${latest_res}" | jq -r '.deployments[0].state')"
if [[ "${latest_deployment_state}" != "READY" ]]; then
echo "Last deployment was not in READY state. Skipping redeployment."
echo "A human reading this message should decide if a redeployment is necessary."
echo "Please check the Vercel dashboard for more information."
echo "https://vercel.com/codercom/registry/deployments"
exit 1
fi
echo "============================================================="
echo "!!! Redeploying registry with deployment ID: ${latest_id} !!!"
echo "============================================================="
if ! curl -X POST "https://api.vercel.com/v13/deployments?forceNew=1&skipAutoDetectionConfirmation=1&slug=$VERCEL_TEAM_SLUG&teamId=$VERCEL_TEAM_ID" \
--fail \
--header "Authorization: Bearer $VERCEL_API_KEY" \
--header "Content-Type: application/json" \
--data-raw "{ \"deploymentId\": \"${latest_id}\", \"name\": \"${VERCEL_APP}\", \"target\": \"production\" }"; then
echo "DEPLOYMENT FAILED! Please check the Vercel dashboard for more information."
echo "https://vercel.com/codercom/registry/deployments"
exit 1
fi
}
# Check each module's accessibility
for module in "${modules[@]}"; do
# Trim leading/trailing whitespace from module name
module=$(echo "${module}" | xargs)
url="${REGISTRY_BASE_URL}/modules/${module}"
printf "=== Checking module %s at %s\n" "${module}" "${url}"
status_code=$(curl --output /dev/null --head --silent --fail --location "${url}" --retry 3 --write-out "%{http_code}")
if ((status_code != 200)); then
printf "==> FAIL(%s)\n" "${status_code}"
status=1
failures+=("${module}")
else
printf "==> OK(%s)\n" "${status_code}"
fi
done
# Determine overall status and update Instatus component
if ((status == 0)); then
echo "All modules are operational."
# set to
update_component_status "OPERATIONAL"
else
echo "The following modules have issues: ${failures[*]}"
# check if all modules are down
if ((${#failures[@]} == ${#modules[@]})); then
update_component_status "MAJOROUTAGE"
else
update_component_status "PARTIALOUTAGE"
fi
# Check if there is an existing incident before creating a new one
if ! check_existing_incident; then
create_incident
fi
# If a module is down, force a reployment to try getting things back online
# ASAP
# EDIT: registry.coder.com is no longer hosted on vercel
#force_redeploy_registry
fi
exit "${status}"