Commit Graph

13 Commits

Author SHA1 Message Date
Jack Wright
23ba613b00
Polars AWS S3 support (#14648)
# Description

Provides Amazon S3 support.

- Utilizes your existing AWS cli configuration. 
- Supports AWS SSO
- Supports
[gimme-aws-creds](https://github.com/Nike-Inc/gimme-aws-creds).
- respects the settings of AWS_PROFILE environment variable for
selecting profile config
- AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_REGION environment
variables for configuring without an AWS config

Usage:
```nushell
polars open s3://bucket/and/path.parquet
```

Supports:
- CSV
- Parquet
- NDJSON / json lines
- Arrow

Doesn't support:
- eager dataframes
-  Avro
- JSON
2024-12-25 06:15:50 -06:00
Jack Wright
c535c24d03
catch unwrap on panics with polars collect (#13850)
# Description
This resurrects the work from #12866 and fixes #12732. 

Polars panics for a plethora or reasons. While handling panics is
generally frowned upon, in cases like with `polars collect` a panic
cause a lot of work to be lost. Often you might have multiple dataframes
in memory and you are trying one operation and lose all state.

While it possible the panic can leave things a strange state, it is
pretty unlikely as part of a polars pipeline. Most of the time polars
objects are not manipulating dataframes in memory mutability, but rather
creating a new dataframe the operations being applied. This is always
the case with a lazy pipeline. After the collect call, the original
dataframes are intact still and I haven't observed any side effects.
2024-09-15 07:21:02 -05:00
Jack Wright
8d60c0d35d
Migrating polars commands away from macros, removed custom DataFrame comparison. (#13829)
# Description
This PR:
- Removes the lazy_command, expr_command macros and migrates the
commands that were utilizing them.
- Removes the custom logic in DataFrameValues::is_equals to use the
polars DataFrame version of PartialEq
- Adds examples to commands that previously did not have examples or had
inadequate ones.

NOTE: A lot of examples now have a `polars sort` at the end. This is
needed due to the comparison in the result. The new polars version of
equals cares about the ordering. I removed the custom equals logic as it
causes comparisons to lock up when comparing dataframes that contain a
row that contains a list. I discovered this issue when adding examples
to `polars implode`
2024-09-11 10:33:05 -07:00
Jack Wright
f531cc2058
Polars command reorg (#13798)
# Description
House keeping. Restructures polars modules as discussed in:
https://docs.google.com/spreadsheets/d/1gyA58i_yTXKCJ5DbO_RxBNAlK6S7C1M22ppKwVLZltc/edit?usp=sharing
2024-09-06 13:46:37 -07:00
Jack Wright
8316a1597e
Polars: Check to see if the cache is empty before enabling GC. More logging (#13286)
There was a bug where anytime the plugin cache remove was called, the
plugin gc was turned back on. This probably happened when I added the
reference counter logic.
2024-07-03 06:44:26 -05:00
Jack Wright
1f1f581357
Converted perf function to be a macro. Utilized the perf macro within the polars plugin. (#13224)
In this pull request, I converted the `perf` function within `nu_utils`
to a macro. This change facilitates easier usage within plugins by
allowing the use of `env_logger` and setting `RUST_LOG=nu_plugin_polars`
(or another plugin). Without this conversion, the `RUST_LOG` variable
would need to be set to `RUST_LOG=nu_utils::utils`, which is less
intuitive and impossible to narrow the perf results to one plugin.
2024-06-27 18:56:56 -05:00
Devyn Cairns
91d44f15c1
Allow plugins to report their own version and store it in the registry (#12883)
# Description

This allows plugins to report their version (and potentially other
metadata in the future). The version is shown in `plugin list` and in
`version`.

The metadata is stored in the registry file, and reflects whatever was
retrieved on `plugin add`, not necessarily the running binary. This can
help you to diagnose if there's some kind of mismatch with what you
expect. We could potentially use this functionality to show a warning or
error if a plugin being run does not have the same version as what was
in the cache file, suggesting `plugin add` be run again, but I haven't
done that at this point.

It is optional, and it requires the plugin author to make some code
changes if they want to provide it, since I can't automatically
determine the version of the calling crate or anything tricky like that
to do it.

Example:

```
> plugin list | select name version is_running pid
╭───┬────────────────┬─────────┬────────────┬─────╮
│ # │      name      │ version │ is_running │ pid │
├───┼────────────────┼─────────┼────────────┼─────┤
│ 0 │ example        │ 0.93.1  │ false      │     │
│ 1 │ gstat          │ 0.93.1  │ false      │     │
│ 2 │ inc            │ 0.93.1  │ false      │     │
│ 3 │ python_example │ 0.1.0   │ false      │     │
╰───┴────────────────┴─────────┴────────────┴─────╯
```

cc @maxim-uvarov (he asked for it)

# User-Facing Changes

- `plugin list` gets a `version` column
- `version` shows plugin versions when available
- plugin authors *should* add `fn metadata()` to their `impl Plugin`,
but don't have to

# Tests + Formatting

Tested the low level stuff and also the `plugin list` column.

# After Submitting
- [ ] update plugin guide docs
- [ ] update plugin protocol docs (`Metadata` call & response)
- [ ] update plugin template (`fn metadata()` should be easy)
- [ ] release notes
2024-06-21 06:27:09 -05:00
Jack Wright
20834c9d47
Added the ability to turn on performance debugging through and env var for the polars plugin (#13191)
This allows performance debugging to be turned on by setting:

```nushell
$env.POLARS_PLUGIN_PERF = "true"
```

Furthermore, this improves the other plugin debugging by allowing the
env variable for debugging to be set at any time versus having to be
available when nushell is launched:

```nushell
$env.POLARS_PLUGIN_DEBUG = "true"
```

This plugin introduces a `perf` function that will output timing
results. This works very similar to the perf function available in
nu_utils::utils::perf. This version prints everything to std error to
not break the plugin stream and uses the engine interface to see if the
env variable is configured.

This pull requests uses this `perf` function when:
* opening csv files as dataframes
* opening json lines files as dataframes

This will hopefully help provide some more fine grained information on
how long it takes polars to open different dataframes. The `perf` can
also be utilized later for other dataframes use cases.
2024-06-20 16:37:38 -07:00
Jack Wright
a60381a932
Added commands for working with the plugin cache. (#12576)
# Description
This pull request provides three new commands:
`polars store-ls` - moved from `polars ls`. It provides the list of all
object stored in the plugin cache
`polars store-rm` - deletes a cached object
`polars store-get` - gets an object from the cache. 

The addition of `polars store-get` required adding a reference_count to
cached entries. `polars get` is the only command that will increment
this value. `polars rm` will remove the value despite it's count. Calls
to PolarsPlugin::custom_value_dropped will decrement the value.

The prefix store- was chosen due to there already being a `polars cache`
command. These commands were not made sub-commands as there isn't a way
to display help for sub commands in plugins (e.g. `polars store`
displaying help) and I felt the store- seemed fine anyways.

The output of `polars store-ls` now shows the reference count for each
object.

# User-Facing Changes
polars ls has now moved to polars store-ls

---------

Co-authored-by: Jack Wright <jack.wright@disqo.com>
2024-04-21 19:43:43 -05:00
Jack Wright
5f818eaefe
Ensure that lazy frames converted via to-lazy are not converted back to eager frames later in the pipeline. (#12525)
# Description
@maxim-uvarov discovered the following error:
```
> [[a b]; [6 2] [1 4] [4 1]] | polars into-lazy | polars sort-by a | polars unique --subset [a]
Error:   × Error using as series
   ╭─[entry #1:1:68]
 1 │ [[a b]; [6 2] [1 4] [4 1]] | polars into-lazy | polars sort-by a | polars unique --subset [a]
   ·                                                                    ──────┬──────
   ·                                                                          ╰── dataframe has more than one column
   ╰────
 ```
 
During investigation, I discovered the root cause was that the lazy frame was incorrectly converted back to a eager dataframe. In order to keep this from happening, I explicitly set that the dataframe did not come from an eager frame. This causes the conversion logic to not attempt to convert the dataframe later in the pipeline.

---------

Co-authored-by: Jack Wright <jack.wright@disqo.com>
2024-04-15 18:29:42 -05:00
Ian Manske
211d9c685c
Fix clippy lint (#12504)
Just fixes a clippy lint.
2024-04-13 16:19:32 +00:00
Jack Wright
b9c2f9ee56
displaying span information, creation time, and size with polars ls (#12472)
# Description
`polars ls` is already different that `dfr ls`. Currently it just shows
the cache key, columns, rows, and type. I have added:
- creation time
- size
- span contents
-  span start and end

<img width="1471" alt="Screenshot 2024-04-10 at 17 27 06"
src="https://github.com/nushell/nushell/assets/56345/545918b7-7c96-4c25-bc01-b9e2b659a408">

# Tests + Formatting
Done

Co-authored-by: Jack Wright <jack.wright@disqo.com>
2024-04-12 09:23:46 -05:00
Jack Wright
efc1cfa939
Move dataframes support to a plugin (#12220)
WIP

This PR covers migration crates/nu-cmd-dataframes to a new plugin
./crates/nu_plugin_polars

## TODO List

Other:
- [X] Fix examples
- [x] Fix Plugin Test Harness
- [X] Move Cache to Mutex<BTreeMap>
- [X] Logic for disabling/enabling plugin GC based off whether items are
cached.
- [x] NuExpression custom values
- [X] Optimize caching (don't cache every object creation). 
- [x] Fix dataframe operations (in NuDataFrameCustomValue::operations)
- [x] Added plugin_debug! macro that for checking an env variable
POLARS_PLUGIN_DEBUG

Fix duplicated commands:
- [x] There are two polars median commands, one for lazy and one for
expr.. there should only be one that works for both. I temporarily
called on polars expr-median (inside expressions_macros.rs)
- [x] polars quantile (lazy, and expr). the expr one is temporarily
expr-median
- [x] polars is-in (renamed one series-is-in)

Commands:
- [x] AppendDF
- [x] CastDF
- [X] ColumnsDF
- [x] DataTypes
- [x] Summary
- [x] DropDF
- [x] DropDuplicates
- [x] DropNulls
- [x] Dummies
- [x] FilterWith
- [X] FirstDF
- [x] GetDF
- [x] LastDF
- [X] ListDF
- [x] MeltDF
- [X] OpenDataFrame
- [x] QueryDf
- [x] RenameDF
- [x] SampleDF
- [x] SchemaDF
- [x] ShapeDF
- [x] SliceDF
- [x] TakeDF
- [X] ToArrow
- [x] ToAvro
- [X] ToCSV
- [X] ToDataFrame
- [X] ToNu
- [x] ToParquet
- [x] ToJsonLines
- [x] WithColumn
- [x] ExprAlias
- [x] ExprArgWhere
- [x] ExprCol
- [x] ExprConcatStr
- [x] ExprCount
- [x] ExprLit
- [x] ExprWhen
- [x] ExprOtherwise
- [x] ExprQuantile
- [x] ExprList
- [x] ExprAggGroups
- [x] ExprCount
- [x] ExprIsIn
- [x] ExprNot
- [x] ExprMax
- [x] ExprMin
- [x] ExprSum
- [x] ExprMean
- [x] ExprMedian
- [x] ExprStd
- [x] ExprVar
- [x] ExprDatePart
- [X] LazyAggregate
- [x] LazyCache
- [X] LazyCollect
- [x] LazyFetch
- [x] LazyFillNA
- [x] LazyFillNull
- [x] LazyFilter
- [x] LazyJoin
- [x] LazyQuantile
- [x] LazyMedian
- [x] LazyReverse
- [x] LazySelect
- [x] LazySortBy
- [x] ToLazyFrame
- [x] ToLazyGroupBy
- [x] LazyExplode
- [x] LazyFlatten
- [x] AllFalse
- [x] AllTrue
- [x] ArgMax
- [x] ArgMin
- [x] ArgSort
- [x] ArgTrue
- [x] ArgUnique
- [x] AsDate
- [x] AsDateTime
- [x] Concatenate
- [x] Contains
- [x] Cumulative
- [x] GetDay
- [x] GetHour
- [x] GetMinute
- [x] GetMonth
- [x] GetNanosecond
- [x] GetOrdinal
- [x] GetSecond
- [x] GetWeek
- [x] GetWeekDay
- [x] GetYear
- [x] IsDuplicated
- [x] IsIn
- [x] IsNotNull
- [x] IsNull
- [x] IsUnique
- [x] NNull
- [x] NUnique
- [x] NotSeries
- [x] Replace
- [x] ReplaceAll
- [x] Rolling
- [x] SetSeries
- [x] SetWithIndex
- [x] Shift
- [x] StrLengths
- [x] StrSlice
- [x] StrFTime
- [x] ToLowerCase
- [x] ToUpperCase
- [x] Unique
- [x] ValueCount

---------

Co-authored-by: Jack Wright <jack.wright@disqo.com>
2024-04-09 19:31:43 -05:00