Optimize imports for extremely large history sizes for #168 (#171)

* Swap to using iterators for uploading to avoid storing all chunks in memory

* Chunk uploads for reuploading

* Revert "Swap to using iterators for uploading to avoid storing all chunks in memory"

This reverts commit 632ecc5c81.

* Make hishtory install work even if there is zero shell history on the device

* Skip DD integration for m1 mac since it seems to fail for mysterious beta-related reasons

* Log OpenAI error to debug log for #167

* Release v0.269

* Add explicit handling for 429 error code from OpenAI

* Release v0.270

* Fix handling of new lines in commands for #163 (#170)

* Fix handling of new lines in commands for #163

* Move code for table from lib.go to query.go

* Update goldens

* Release v0.271

* Properly silence which output to fix #166

* Release v0.272

* Add || true to fully fix #166

* Release v0.273

* Improve install.py script to attempt to detect when /tmp/ is noexec (#172)

* Improve install.py script to attempt to detect when /tmp/ is noexec

* Add test to install from python script at HEAD

* Remove incorrect duplicated line

* Delete the tmp hishtory-client download since it may be dropped in CWD rather than /tmp/

* Add basic smoke test to provide test coverage for other distros (#174)

* Fix quotes on container names

* More tweaks for smoke testing

* Skip setting the hostname for smoke tests since we don't need it

* Dependencies for smoke testing

* Add cgo deps

* Install killall command

* Add two more distros for smoke testing

* Add smoke tests for arch

* Update distro-smoke-test.yml

* Remove sudo since the arch container runs as root

* Drop sudo for OpenSUSE

* Update install commands for OpenSUSE and Arch

* More tweaks to install commands

* Update arch install command

* Remove OpenSUSE since their package repos are currently returning 500 errors

* Add another dep for arch

* Move up os.remove so that the file is removed even if it fails to execute

* Move function to start of python file to make it more idiomatic

* Update go action to enable caching of dependencies

* Run integration tests in parallel to speed up testing (#175)

* Remove a few direct DB insertions to prepare for parallel tests

* Revert "Remove a few direct DB insertions to prepare for parallel tests"

This reverts commit f8a3552ad8.

* Add rudimentary experiment of splitting tests into two chunks to make them faster

* Add missing tag

* Remove code that enforces that all goldens are used, since it is incompatible with how tests are currently split into chunks

* Lay out the framework for checking goldens being used across all test runs

* Fix missing brace

* Revert "Remove code that enforces that all goldens are used, since it is incompatible with how tests are currently split into chunks"

This reverts commit 06cc3eedbc.

* Add initial work towards checking that all goldens are used

* Delete incorrect and unreferenced matrix

* Upgrade actions/upload-artifact to see if that makes the download in the next job work

* Alternatively, try downloading the artifact by name

* Update golden checker to read all the golden artifacts

* Swap to using glob to enumerate all golden files, rather than hardcoding them

* Remove debugging commands

* Remove goldens that are actually used

* Remove another golden that is actually used

* Add more comprehensive support for test sharding

* Fix references to test shards and increase shard count

* Shard the fuzz test

* Add debug prints

* Mark additional tests for sharding

* Fix logic error that broke test sharding

* Remove debug print

* Fix incorrect logic with skipping the fuzz test

* Move sharding functions to testutils and add some comments

* Upgrade all setup-go actions to enable caching of deps

* Remove goldens that don't exist

* Remove new line

* Reduce delay

* Correct stage name

* Remove incorrect skip code from the first version of sharding

* Remove unused import

* Reduce number of test shards to match GitHub's limit of 5 concurrent macos jobs

* Use cask for installing homebrew to speed up github actions

* More cleanup for unused goldens

* Swap away from brew cask since it appears to be slower

* Add sync server to status -v #176 so that self-hosted users can easily confirm they're using the self-hosted server (#178)

* Release v0.274

* Make bash support lenient with empty history lines, which seems to happen for the first command or two of new installs

* Remove unnecessary sub-shell, since we just need a truthy value here

* Release v0.275

* Add web UI for querying history from the browser (#180)

As requested in #176 and #147 

* Add initail version of a web UI for querying history from the browser

* Rename webui command

* Add basic test for the web UI

* Add README for the web UI

* Add basic auth for the web server

* Add status code when panic-ing

* Release v0.276

* Add ability to disable auth and force specific creds for the web UI

* Add cleaning for integration test devices to remove DB entries

* Wire through the shell name into AI suggestions so that we can get more precise AI suggestions for the current shell

* Add support for control-A and control-E shortcuts similar to GNU readline

* Allow register new device when exceed user limit when user already exist (#181)

* Add basic readline-like support for using control-left and control-right to scroll horizontally by one word at a time

* Release v0.277

* Improve word boundary algorithm to ignore previous spaces so that control+arrow-keys will skip over repeated spaces

* Update colored golden

* Update test golden

* Update golden

* Disable colored output tests

* Add updated goldens

* Delete temporarily unused goldens

* Delete an unused file

* Bump github.com/jackc/pgx/v4 from 4.14.1 to 4.18.2 (#189)

Bumps [github.com/jackc/pgx/v4](https://github.com/jackc/pgx) from 4.14.1 to 4.18.2.
- [Changelog](https://github.com/jackc/pgx/blob/v4.18.2/CHANGELOG.md)
- [Commits](https://github.com/jackc/pgx/compare/v4.14.1...v4.18.2)

---
updated-dependencies:
- dependency-name: github.com/jackc/pgx/v4
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump google.golang.org/protobuf from 1.28.1 to 1.33.0 (#191)

Bumps google.golang.org/protobuf from 1.28.1 to 1.33.0.

---
updated-dependencies:
- dependency-name: google.golang.org/protobuf
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Upgrade SLSA releaser due to github.com/slsa-framework/slsa-github-generator/issues/3350

* Release v0.278

* Update slsa-verifier to attempt to fix SLSA breakage

* Release v0.279

* Release v0.280

* Add better error message for SLSA failures

* Disable validation so we can push out a working binary even though SLSA is broken

* Release v0.281

* Fully disable validation to allow an emergency release due to SLSA breakage

* Release v0.282

* Update cosign too to fix slsa breakage from https://blog.sigstore.dev/tuf-root-update/

* Release v0.283

* Release v0.284

* Fix go.mod version after cosign upgrade

* Update go.sum after cosign update

* Release v0.285

* Re-enable SLSA verification now that we've updated the SLSA version throughout the repo

* Release v0.286

* Disable validation with local build since it seems to fail for some reason

* Add SLSA validation with current binary built by SLSA

* Set up tmate session to debug slsa releaser

* Add SLSA failure warning for versions broken by SLSA

* Remove tmate session for debugging

* Release v0.287

* Bump gopkg.in/go-jose/go-jose.v2 from 2.6.1 to 2.6.3 (#197)

Bumps gopkg.in/go-jose/go-jose.v2 from 2.6.1 to 2.6.3.

---
updated-dependencies:
- dependency-name: gopkg.in/go-jose/go-jose.v2
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Add support for horizontal scrolling of all columns for #188 (#195)

* Bump github.com/docker/docker (#193)

Bumps [github.com/docker/docker](https://github.com/docker/docker) from 24.0.7+incompatible to 24.0.9+incompatible.
- [Release notes](https://github.com/docker/docker/releases)
- [Commits](https://github.com/docker/docker/compare/v24.0.7...v24.0.9)

---
updated-dependencies:
- dependency-name: github.com/docker/docker
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump github.com/sigstore/rekor from 1.0.0 to 1.2.0 (#91)

Bumps [github.com/sigstore/rekor](https://github.com/sigstore/rekor) from 1.0.0 to 1.2.0.
- [Release notes](https://github.com/sigstore/rekor/releases)
- [Changelog](https://github.com/sigstore/rekor/blob/main/CHANGELOG.md)
- [Commits](https://github.com/sigstore/rekor/compare/v1.0.0...v1.2.0)

---
updated-dependencies:
- dependency-name: github.com/sigstore/rekor
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Add ability to configure custom OpenAI API endpoint for #186 (#194)

* Add ability to configure custom OpenAI API endpoint for #186

* Ensure the AiCompletionEndpoint field is always initialized

* Release v0.288

* Enable colored golden tests for linux (#184)

* Enable golden tests for linux and ensure all goldens get saved as outputs

* Swap in OS specific goldens

* Update colored goldens to take into account OS version, since different macos versions have different behavior here

* Update goldens

* Re-enable golden tests

* Add missing golden

* Empty commit

* Remove linux kernel version from OS name

* Remove minor version numbers from os versions for golden files for tests

* Continue-on-error for the DD setup since it will also fail if colima fails

* Add test for horizontal scrolling other columns for #188

* Add support for forcing init without prompting via --force flag for #198

* Clean up: Remove duplicated code by calling existing utility function

* Add mouse scrolling support for #200

* Revert "Add mouse scrolling support for #200" since it breaks the ability to highlight text

This reverts commit 7d9bb6654d.

* Release v0.289

* Add benchmarking for searching for #202

* Add index of start time so that queries with a LIMIT clause can avoid a full table scan (for #202)

* Release v0.290

* Add --port flag for the web UI for #203

* Add additional test for smoke tests to cover syncing

* Move extra delay to a separate job to avoid wasting GH action quota by sleeping in duplicated jobs

* Release v0.291

* Revert "Add additional test for smoke tests to cover syncing"

This reverts commit 514d95ba4e.

* Fix double-syncing error where devices receive entries from themselves #202 (#204)

* Fix double-syncing error where devices receive entries from themselves

* Fix incorrect error message

* Add TODO

* Update TestESubmitThenQuery after making query more efficient

* Update TestDeletionRequests and remove unnecessary asserts

* Swap server_test.go to using require

* Fix incorrect require due to typo

* Slow down gif per feedback in #199

* Update bubbletea to include 2b46020ca0725219da1a7d7969fa85c486181258 since it seems to help fix #185

* Fix test broken by 7ae9f15b by making sure input is sent and processed as separate events

* Fix test broken by 7ae9f15b by making sure input is sent and processed as separate events

* Fix test broken by 7ae9f15b by making sure input is sent and processed as separate events and updating the golden to reflect this

* Release v0.292

* Bump golang.org/x/net from 0.22.0 to 0.23.0 (#206)

Bumps [golang.org/x/net](https://github.com/golang/net) from 0.22.0 to 0.23.0.
- [Commits](https://github.com/golang/net/compare/v0.22.0...v0.23.0)

---
updated-dependencies:
- dependency-name: golang.org/x/net
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Remove darwin-21 goldens since they're no longer used now that GH upgraded their macos image (#210)

* Update backend to avoid persisting entries to be read by devices that have been uninstalled

* Add support for custom key bindings for #190 (#209)

* Add support for custom key bindings for #190

* Add tests for configuring custom key bindings

* Simplify key bindings test

* Add docs on custom key bindings + error message for unhandled actions

* Fix condition added in d6a60214a2 to also apply to rows with the go 'empty' value and not just null

* Add support for enabling/disabling syncing post-install

* Release v0.293

* fix: close file (#213)

* Release v0.294

* Move docs on custom key bindings to a more logical location

* Fix duplicate pre-saving issue reported in #215

* Revert "Fix duplicate pre-saving issue reported in #215"

This reverts commit 336b331687.

* Fix duplicate pre-saving issue reported in #215 (#217)

* Release v0.295

* Add full fix for #215 along with a test to reproduce the issue (#218)

* Release v0.296

* Add ability to skip config modifications for #212 (#216)

* Add ability to skip config modifications

* Update golden names to fork on OS

* Remove incorrect newline in golden

* Add README documentation for default-filter

* Update title for section

* Release v0.297

* Add basic fix for #225 by escaping tab characters before rendering

This is a tricky bug to fix because the width of a tab character varies depending on context. This means that when we're trying to build a table and calculating the width of columns for budgeting, we can't actually know the width of a tab without knowing exactly what characters come before it. This is in theory doable, but it leads to some really complex code that I'd rather not adopt.

* Release v0.298

* Bump github.com/hashicorp/go-retryablehttp from 0.7.2 to 0.7.7 (#223)

Bumps [github.com/hashicorp/go-retryablehttp](https://github.com/hashicorp/go-retryablehttp) from 0.7.2 to 0.7.7.
- [Changelog](https://github.com/hashicorp/go-retryablehttp/blob/main/CHANGELOG.md)
- [Commits](https://github.com/hashicorp/go-retryablehttp/compare/v0.7.2...v0.7.7)

---
updated-dependencies:
- dependency-name: github.com/hashicorp/go-retryablehttp
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Add additional fallback method for retrieving the CWD to further improve the situation for #226

* Explicitly install openssl to see if it fixes smoke test errors on arch

* Add integration test for #226

* Release v0.299

* Update macos version for signer since GH dropped support for macos 11

* Release v0.300

* Swap to macos-latest to see if GH has more quota for that tag

* Release v0.301

* Release v0.302

* Upgrade to setup-go@v4 for automatic caching support

* Revert "Remove OpenSUSE since their package repos are currently returning 500 errors"

This reverts commit 62700605d7.

* Install git and tar for opensuse smoke tests

* Link /bin/sh for opensuse smoke tests

* Remove opensuse smoke tests

* use http.DefaultClient (#232)

* Add new short name for "ExitCode" - "$?" (#228)

* Add more short column name alternatives similar to #228

* add forceComapctMode config entry (#237)

* Add docs in readme to call out shorter column names as added in #228

* Change compact-mode setting that was added in #237 to respect the convention of taking in an argument

* Add config-get compact-mode command (as needed by #237)

* Move checking of forced compact mode into helper functions to ensure it is checked everywhere (follow up to #237)

* Add test for forced compact mode (from #237)

* ai: add some new env variables to control OpenAI requests (#231)

Co-authored-by: David Dworken <david@daviddworken.com>

* Update incorrect docs on ClientConfig struct

* Add ability for the client to configure the model via an environment variable

* Bump github.com/docker/docker (#236)

Bumps [github.com/docker/docker](https://github.com/docker/docker) from 24.0.9+incompatible to 25.0.6+incompatible.
- [Release notes](https://github.com/docker/docker/releases)
- [Commits](https://github.com/docker/docker/compare/v24.0.9...v25.0.6)

---
updated-dependencies:
- dependency-name: github.com/docker/docker
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* run "make fmt" (#233)

* Add make fmt to pre-commit

* Fix import

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Nguyễn Hoàng Đức <lazyc97@tutanota.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: guangwu <guoguangwu@magic-shield.com>
Co-authored-by: Pavel Griaznov <grbitt@gmail.com>
This commit is contained in:
David Dworken 2024-08-25 18:07:47 -07:00 committed by GitHub
parent bf47575e6f
commit 8cd1bae710
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
3 changed files with 63 additions and 29 deletions

2
.gitignore vendored
View File

@ -11,4 +11,4 @@ package-lock.json
.prettierrc
# VS Code settings
.vscode/
.vscode/

View File

@ -46,7 +46,7 @@ var installCmd = &cobra.Command{
if os.Getenv("HISHTORY_SKIP_INIT_IMPORT") == "" {
db, err := hctx.OpenLocalSqliteDb()
lib.CheckFatalError(err)
count, err := countStoredEntries(db)
count, err := lib.CountStoredEntries(db)
lib.CheckFatalError(err)
if count < 10 {
fmt.Println("Importing existing shell history...")
@ -70,7 +70,7 @@ var initCmd = &cobra.Command{
Run: func(cmd *cobra.Command, args []string) {
db, err := hctx.OpenLocalSqliteDb()
lib.CheckFatalError(err)
count, err := countStoredEntries(db)
count, err := lib.CountStoredEntries(db)
lib.CheckFatalError(err)
if count > 0 && !(*forceInit) {
fmt.Printf("Your current hishtory profile has saved history entries, are you sure you want to run `init` and reset?\nNote: This won't clear any imported history entries from your existing shell\n[y/N]")
@ -133,13 +133,6 @@ var uninstallCmd = &cobra.Command{
},
}
func countStoredEntries(db *gorm.DB) (int64, error) {
return lib.RetryingDbFunctionWithResult(func() (int64, error) {
var count int64
return count, db.Model(&data.HistoryEntry{}).Count(&count).Error
})
}
func warnIfUnsupportedBashVersion() error {
_, err := exec.LookPath("bash")
if err != nil {

View File

@ -612,32 +612,59 @@ func Reupload(ctx context.Context) error {
if config.IsOffline {
return nil
}
entries, err := Search(ctx, hctx.GetDb(ctx), "", 0)
numEntries, err := CountStoredEntries(hctx.GetDb(ctx))
if err != nil {
return fmt.Errorf("failed to reupload due to failed search: %w", err)
return fmt.Errorf("failed to upload history entries due to error in counting entries: %v", err)
}
var bar *progressbar.ProgressBar
if len(entries) > NUM_IMPORTED_ENTRIES_SLOW {
if numEntries > int64(NUM_IMPORTED_ENTRIES_SLOW) {
fmt.Println("Persisting history entries")
bar = progressbar.Default(int64(len(entries)))
bar = progressbar.Default(int64(numEntries))
defer bar.Finish()
}
chunkSize := 500
chunks := shared.Chunks(entries, chunkSize)
return shared.ForEach(chunks, 10, func(chunk []*data.HistoryEntry) error {
jsonValue, err := EncryptAndMarshal(config, chunk)
// This number is a balance between speed and memory usage. If we make it too high, then
// it will mean we use a ton of memory (since we retrieve all of those entries). But if
// we make it too low, then it will have to do repeated SQL queries with OFFSETs, which
// are inherently slow.
searchChunkSize := 300_000
currentOffset := 0
for {
entries, err := SearchWithOffset(ctx, hctx.GetDb(ctx), "", searchChunkSize, currentOffset)
if err != nil {
return fmt.Errorf("failed to reupload due to failed encryption: %w", err)
return fmt.Errorf("failed to reupload due to failed search: %w", err)
}
_, err = ApiPost(ctx, "/api/v1/submit?source_device_id="+config.DeviceId, "application/json", jsonValue)
if len(entries) == 0 {
if currentOffset == 0 && numEntries != 0 {
return fmt.Errorf("found no entries for reuploading, something went wrong")
} else {
return nil
}
}
currentOffset += searchChunkSize
// This number is a balance between speed, and ensuring that we don't send too much data
// in a single request (since large individual requests are extremely slow). From benchmarking,
// it is apparent that this value seems to work quite well.
uploadChunkSize := 500
chunks := shared.Chunks(entries, uploadChunkSize)
err = shared.ForEach(chunks, 10, func(chunk []*data.HistoryEntry) error {
jsonValue, err := EncryptAndMarshal(config, chunk)
if err != nil {
return fmt.Errorf("failed to reupload due to failed encryption: %w", err)
}
_, err = ApiPost(ctx, "/api/v1/submit?source_device_id="+config.DeviceId, "application/json", jsonValue)
if err != nil {
return fmt.Errorf("failed to reupload due to failed POST: %w", err)
}
if bar != nil {
_ = bar.Add(uploadChunkSize)
}
return nil
})
if err != nil {
return fmt.Errorf("failed to reupload due to failed POST: %w", err)
return err
}
if bar != nil {
_ = bar.Add(chunkSize)
}
return nil
})
}
}
func RetrieveAdditionalEntriesFromRemote(ctx context.Context, queryReason string) error {
@ -775,12 +802,16 @@ func MakeWhereQueryFromSearch(ctx context.Context, db *gorm.DB, query string) (*
}
func Search(ctx context.Context, db *gorm.DB, query string, limit int) ([]*data.HistoryEntry, error) {
return retryingSearch(ctx, db, query, limit, 0)
return SearchWithOffset(ctx, db, query, limit, 0)
}
func SearchWithOffset(ctx context.Context, db *gorm.DB, query string, limit, offset int) ([]*data.HistoryEntry, error) {
return retryingSearch(ctx, db, query, limit, offset, 0)
}
const SEARCH_RETRY_COUNT = 3
func retryingSearch(ctx context.Context, db *gorm.DB, query string, limit, currentRetryNum int) ([]*data.HistoryEntry, error) {
func retryingSearch(ctx context.Context, db *gorm.DB, query string, limit, offset, currentRetryNum int) ([]*data.HistoryEntry, error) {
if ctx == nil && query != "" {
return nil, fmt.Errorf("lib.Search called with a nil context and a non-empty query (this should never happen)")
}
@ -798,13 +829,16 @@ func retryingSearch(ctx context.Context, db *gorm.DB, query string, limit, curre
if limit > 0 {
tx = tx.Limit(limit)
}
if offset > 0 {
tx = tx.Offset(offset)
}
var historyEntries []*data.HistoryEntry
result := tx.Find(&historyEntries)
if result.Error != nil {
if strings.Contains(result.Error.Error(), SQLITE_LOCKED_ERR_MSG) && currentRetryNum < SEARCH_RETRY_COUNT {
hctx.GetLogger().Infof("Ignoring err=%v and retrying search query, cnt=%d", result.Error, currentRetryNum)
time.Sleep(time.Duration(currentRetryNum*rand.Intn(50)) * time.Millisecond)
return retryingSearch(ctx, db, query, limit, currentRetryNum+1)
return retryingSearch(ctx, db, query, limit, offset, currentRetryNum+1)
}
return nil, fmt.Errorf("DB query error: %w", result.Error)
}
@ -1012,3 +1046,10 @@ func SendDeletionRequest(ctx context.Context, deletionRequest shared.DeletionReq
}
return nil
}
func CountStoredEntries(db *gorm.DB) (int64, error) {
return RetryingDbFunctionWithResult(func() (int64, error) {
var count int64
return count, db.Model(&data.HistoryEntry{}).Count(&count).Error
})
}